Urllib urlopen/urlretrieve too many open files error - python

Problem
I'm trying to download >100.000 files from a ftp server in parallel (using threads). I previously tried it with urlretrieve as answered here, however this gave me the following error: URLError(OSError(24, 'Too many open files')). Apparently this problem is a bug (cannot find the reference anymore), so I tried to use urlopen in combination with shutil and then write it to file which I could close myself, as described here. This seemed to work fine, but then I got the same error again: URLError(OSError(24, 'Too many open files')). I thought whenever writing to a file is incomplete or will fail the with statement will cause to file to close itself, but seemingly the files still keep open and will eventually cause the script to halt.
Question
How can I prevent this error, i.e. make sure that every files get closed?
Code
import csv
import urllib.request
import shutil
from multiprocessing.dummy import Pool
def url_to_filename(url):
filename = 'patric_genomes/' + url.split('/')[-1]
return filename
def download(url):
url = url.strip()
try:
with urllib.request.urlopen(url) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
except Exception as e:
return None, e
def build_urls(id_list):
base_url = 'ftp://some_ftp_server/'
urls = []
for some_id in id_list:
url = base_url + some_id + '/' + some_id + '.fna'
print(url)
urls.append(url)
return urls
if __name__ == "__main__":
with open('full_data/genome_ids.txt') as inFile:
reader = csv.DictReader(inFile, delimiter = '\t')
ids = {row['some_id'] for row in reader}
urls = build_urls(ids)
p = Pool(100)
print(p.map(download, urls))

You may try to use contextlib to close your file as such:
import contextlib
[ ... ]
with contextlib.closing(urllib.request.urlopen(url)) as response, open(url_to_filename(url), 'wb') as out_file:
shutil.copyfileobj(response, out_file)
[ ... ]
According to the docs:
contextlib.closing(thing)
Return a context manager that closes thing upon completion of the block. [ ... ] without needing to explicitly close page. Even if an error occurs, page.close() will be called when the with block is exited.
*** A workaround would be raising the open files limit on your Linux OS. Check your current open files limit:
ulimit -Hn
Add the following line in your /etc/sysctl.conf file:
fs.file-max = <number>
Where <number> is the new upper limit of open files you want to set.
Close and save the file.
sysctl -p
So that changes take effect.

I believe that file handlers you create are not disposed in time by the system, as it takes some time to close connection. So you end up using all the free file handlers (and that includes network sockets) very quickly.
What you do is setting up FTP connection for each of your files. This is a bad practice. A better way is opening 5-15 connections and reusing them, downloading the files through existing sockets, without the overhead of initial FTP handshaking for each file. See this post for reference.
P.S. Also, as #Tarun_Lalwani mentioned, it is not a good idea to create a folder with more than a ~1000 files in it, as it slows down the file system.

How can I prevent this erorr, i.e. make sure that every files get closed?
To prevent the error you need to either increase open file limit, or, which is more reasonable, decrease concurrency in your thread pool. Connection and file closing is done by the context manager properly.
Your thread pool has 100 threads and opens at least 200 handles (one for FTP connection and another for file). Reasonable concurrency would be about 10-30 threads.
Here's simplified reproduction which shows that the code is okay. Put some content in somefile in current directory.
test.py
#!/usr/bin/env python3
import sys
import shutil
import logging
from pathlib import Path
from urllib.request import urlopen
from multiprocessing.dummy import Pool as ThreadPool
def download(id):
ftp_url = sys.argv[1]
filename = Path(__name__).parent / 'files'
try:
with urlopen(ftp_url) as src, (filename / id).open('wb') as dst:
shutil.copyfileobj(src, dst)
except Exception as e:
logging.exception('Download error')
if __name__ == '__main__':
with ThreadPool(10) as p:
p.map(download, (str(i).zfill(4) for i in range(1000)))
And then in the same directory:
$ docker run --name=ftp-test -d -e FTP_USER=user -e FTP_PASSWORD=pass \
-v `pwd`/somefile:/srv/dir/somefile panubo/vsftpd vsftpd /etc/vsftpd.conf
$ IP=`docker inspect --format '{{ .NetworkSettings.IPAddress }}' ftp-test`
$ curl ftp://user:pass#$IP/dir/somefile
$ python3 client.py ftp://user:pass#$IP/dir/somefile
$ docker stop ftp-test && docker rm -v ftp-test

Related

Creating view in browser functionality with python

I have been struggling with this problem for a while but can't seem to find a solution for it. The situation is that I need to open a file in browser and after the user closes the file the file is removed from their machine. All I have is the binary data for that file. If it matters, the binary data comes from Google Storage using the download_as_string method.
After doing some research I found that the tempfile module would suit my needs, but I can't get the tempfile to open in browser because the file only exists in memory and not on the disk. Any suggestions on how to solve this?
This is my code so far:
import tempfile
import webbrowser
# grabbing binary data earlier on
temp = tempfile.NamedTemporaryFile()
temp.name = "example.pdf"
temp.write(binary_data_obj)
temp.close()
webbrowser.open('file://' + os.path.realpath(temp.name))
When this is run, my computer gives me an error that says that the file cannot be opened since it is empty. I am on a Mac and am using Chrome if that is relevant.
You could try using a temporary directory instead:
import os
import tempfile
import webbrowser
# I used an existing pdf I had laying around as sample data
with open('c.pdf', 'rb') as fh:
data = fh.read()
# Gives a temporary directory you have write permissions to.
# The directory and files within will be deleted when the with context exits.
with tempfile.TemporaryDirectory() as temp_dir:
temp_file_path = os.path.join(temp_dir, 'example.pdf')
# write a normal file within the temp directory
with open(temp_file_path, 'wb+') as fh:
fh.write(data)
webbrowser.open('file://' + temp_file_path)
This worked for me on Mac OS.

ftplib - Python: script hangs on downloading big files

Backstory is im trying to pull some data from an ftp login I was given. This data constantly gets updated, about daily, and I believe they wipe the ftp at the end of each week or month. I was thinking about inputting a date and having the script run daily to see if there are any files that match the date, but if the servers time isn't accurate it could cause data loss. For now I just want to download ALL the files, and then ill work on fine-tuning it.
I haven't worked much with coding ftp before, but seems simple enough. However, the problem I'm having is small files get downloaded without a problem and their file sizes check out and match. When it tries to download a big file that would normally take a few minutes, it gets to a certain point (almost completing the file) and then it just stops and the script hangs.
For Example:
It tries to download a file that is 373485927 bytes in size. The script runs and downloads that file up until 373485568 bytes. It ALWAYS stops at this amount after trying different methods and changing some code.
Don't understand why it always stops at this byte and why it would work fine with smaller files (1000 bytes and under).
import os
import sys
import base64
import ftplib
def get_files(ftp, filelist):
for f in filelist:
try:
print "Downloading file " + f + "\n"
local_file = os.path.join('.', f)
file = open(local_file, "wb")
ftp.retrbinary('RETR ' + f, file.write)
except ftplib.all_errors, e:
print str(e)
file.close()
ftp.quit()
def list_files(ftp):
print "Getting directory listing...\n"
ftp.dir()
filelist = ftp.nlst()
#determine new files to DL, pass to get_files()
#for now we will download all each execute
get_files(ftp, filelist)
def get_conn(host,user,passwd):
ftp = ftplib.FTP()
try:
print "\nConnecting to " + host + "...\n"
ftp.connect(host, 21)
except ftplib.all_errors, e:
print str(e)
try:
print "Logging in...\n"
ftp.login(user, base64.b64decode(passwd))
except ftplib.all_errors, e:
print str(e)
ftp.set_pasv(True)
list_files(ftp)
def main():
host = "host.domain.com"
user = "admin"
passwd = "base64passwd"
get_conn(host,user,passwd)
if __name__ == '__main__':
main()
Output looks like this with file dddd.tar.gz being the big one and never finishes it.
Downloading file aaaa.del.gz
Downloading file bbbb.del.gz
Downloading file cccc.del.gz
Downloading file dddd.tar.gz
This could be caused by a timeout issue, perhaps try in:
def get_conn(host,user,passwd):
ftp = ftplib.FTP()
add in larger timeouts until you have more of an idea whats going on, like:
def get_conn(host,user,passwd):
ftp = ftplib.FTP(timeout=100)
I'm not sure if ftplib defaults to a timeout or not, it would be worth checking and worth checking if you are being timed-out from the server. Hope this helps.
If you are running your scrpit in windows cmd console, try to disable the "QuickEdit Mode" option of cmd.
I had encontered a problem that my ftp script hangs running in windows, but works normally in linux. At last i found that solution is working for me.
Ref:enter link description here

Write a file over network in odoo with authentication

I have a Odoo8 running on my linux server and I need to copy a file from this server to a Windows 10 shared folder with authentication.
I tried to do it programmatically like this:
full_path = "smb://hostname/shared_folder/other_path"
if not os.path.exists(full_path):
os.makedirs(full_path)
full_path = os.path.join(full_path, file_name)
bin_value = stream.decode('base64')
if not os.path.exists(full_path):
try:
with open(full_path, 'wb') as fp:
fp.write(bin_value)
fp.close()
return True
except IOError:
_logger.exception("stream_save writing %s", full_path)
but even if no exception is raised, folders are not created and file is not written.
Then I tried to remove the "smb:" part from the uri and it raised an exception regarding authentication.
I'd like to fix the problem just by using python, possibly avoiding os.system calls or external scripts, but if no other way is possible, then any suggestion is welcome.
I also tried with
"//user:password#hostname"
and
"//domain;user:password#hostname"
both with and without smb
Well, I found it out by myself a way using SAMBA:
First you need to install pysmb (pip install pysmb) then:
from smb.SMBConnection import SMBConnection
conn = SMBConnection(user, password, "my_name", server, domain=domain, use_ntlm_v2 = True)
conn.connect(ip_server)
conn.createDirectory(shared_folder, sub_directory)
file_obj = open(local_path_file,'rb')
conn.storeFile(shared_folder, sub_directory+"/"+filename, file_obj)
file_obj.close()
in my case sub_directory is a whole path, thus I need to create each folder one by one (createDirectory works only this way) and everytime I need to check if the directory does not already exists because otherwise createDirectory raise an exception.
I hope my solution could be useful for others.
If anybody find a better solution, please answer...

urlretrieve hangs when downloading file

I have a very simple script that uses urllib to retrieve a zip file and place it on my desktop. The zip file is only a couple MB in size and doesn't take long to download. However, the script doesn't seem to finish, it just hangs. Is there a way to forcibly close the urlretrieve?...or a better solution?
The URL is to a public ftp size. Is the ftp perhaps the cause?
I'm using python 2.7.8.
url = r'ftp://ftp.ngs.noaa.gov/pub/DS_ARCHIVE/ShapeFiles/IA.ZIP'
zip_path = r'C:\Users\***\Desktop\ngs.zip'
urllib.urlretrieve(url, zip_path)
Thanks in advance!
---Edit---
Was able to use ftplib to accomplish the task...
import os
from ftplib import FTP
import zipfile
ftp_site = 'ftp.ngs.noaa.gov'
ftp_file = 'IA.ZIP'
download_folder = '//folder to place file'
download_file = 'name of file'
download_path = os.path.join(download_folder, download_file)
# Download file from ftp
ftp = FTP(ftp_site)
ftp.login()
ftp.cwd('pub/DS_ARCHIVE/ShapeFiles') #change directory
ftp.retrlines('LIST') #show me the files located in directory
download = open(download_path, 'wb')
ftp.retrbinary('RETR ' + ftp_file, download.write)
ftp.quit()
download.close()
# Unzip if .zip file is downloaded
with zipfile.ZipFile(download_path, "r") as z:
z.extractall(download_folder)
urllib has a very bad support for error catching and debugging. urllib2 is a much better choice. The urlretrieve equivalent in urllib2 is:
resp = urllib2.urlopen(im_url)
with open(sav_name, 'wb') as f:
f.write(resp.read())
And the errors to catch are:
urllib2.URLError, urllib2.HTTPError, httplib.HTTPException
And you can also catch socket.error in case that the network is down.
You can use python requests library with requests-ftp module. It provides easier API and better processes exceptions. See: https://pypi.python.org/pypi/requests-ftp and http://docs.python-requests.org/en/latest/

Delete an uploaded file after downloading it from Flask

I am currently working on a small web interface which allows different users to upload files, convert the files they have uploaded, and download the converted files. The details of the conversion are not important for my question.
I am currently using flask-uploads to manage the uploaded files, and I am storing them in the file system. Once a user uploads and converts a file, there are all sorts of pretty buttons to delete the file, so that the uploads folder doesn't fill up.
I don't think this is ideal. What I really want is for the files to be deleted right after they are downloaded. I would settle for the files being deleted when the session ends.
I've spent some time trying to figure out how to do this, but I have yet to succeed. It doesn't seem like an uncommon problem, so I figure there must be some solution out there that I am missing. Does anyone have a solution?
There are several ways to do this.
send_file and then immediately delete (Linux only)
Flask has an after_this_request decorator which could work for this use case:
#app.route('/files/<filename>/download')
def download_file(filename):
file_path = derive_filepath_from_filename(filename)
file_handle = open(file_path, 'r')
#after_this_request
def remove_file(response):
try:
os.remove(file_path)
file_handle.close()
except Exception as error:
app.logger.error("Error removing or closing downloaded file handle", error)
return response
return send_file(file_handle)
The issue is that this will only work on Linux (which lets the file be read even after deletion if there is still an open file pointer to it). It also won't always work (I've heard reports that sometimes send_file won't wind up making the kernel call before the file is already unlinked by Flask). It doesn't tie up the Python process to send the file though.
Stream file, then delete
Ideally though you'd have the file cleaned up after you know the OS has streamed it to the client. You can do this by streaming the file back through Python by creating a generator that streams the file and then closes it, like is suggested in this answer:
def download_file(filename):
file_path = derive_filepath_from_filename(filename)
file_handle = open(file_path, 'r')
# This *replaces* the `remove_file` + #after_this_request code above
def stream_and_remove_file():
yield from file_handle
file_handle.close()
os.remove(file_path)
return current_app.response_class(
stream_and_remove_file(),
headers={'Content-Disposition': 'attachment', 'filename': filename}
)
This approach is nice because it is cross-platform. It isn't a silver bullet however, because it ties up the Python web process until the entire file has been streamed to the client.
Clean up on a timer
Run another process on a timer (using cron, perhaps) or use an in-process scheduler like APScheduler and clean up files that have been on-disk in the temporary location beyond your timeout (e. g. half an hour, one week, thirty days, after they've been marked "downloaded" in RDMBS)
This is the most robust way, but requires additional complexity (cron, in-process scheduler, work queue, etc.)
You can also store the file's data in memory, delete it, then serve what you have in memory.
For example, if you were serving a PDF:
import io
import os
#app.route('/download')
def download_file():
file_path = get_path_to_your_file()
return_data = io.BytesIO()
with open(file_path, 'rb') as fo:
return_data.write(fo.read())
# (after writing, cursor will be at last byte, so move it to start)
return_data.seek(0)
os.remove(file_path)
return send_file(return_data, mimetype='application/pdf',
attachment_filename='download_filename.pdf')
(above I'm just assuming it's PDF, but you can get the mimetype programmatically if you need)
Flask has an after_request decorator which could work in this case:
#app.route('/', methods=['POST'])
def upload_file():
uploaded_file = request.files['file']
file = secure_filename(uploaded_file.filename)
#app.after_request
def delete(response):
os.remove(file_path)
return response
return send_file(file_path, as_attachment=True, environ=request.environ)
Based on #Garrett comment, the better approach is to not blocking the send_file while removing the file. IMHO, the better approach is to remove it in the background, something like the following is better:
import io
import os
from flask import send_file
from multiprocessing import Process
#app.route('/download')
def download_file():
file_path = get_path_to_your_file()
return_data = io.BytesIO()
with open(file_path, 'rb') as fo:
return_data.write(fo.read())
return_data.seek(0)
background_remove(file_path)
return send_file(return_data, mimetype='application/pdf',
attachment_filename='download_filename.pdf')
def background_remove(path):
task = Process(target=rm(path))
task.start()
def rm(path):
os.remove(path)

Categories

Resources