Python file downloading fails - python

New to python and attempting to download the NIST NVD JSON files. I have tried several methods but it only write about 324 bytes file. If I do one file that does in fact work but there are several files to download for this.
I did try to adjust the chunk_size but still can't get a 1 to 6mb zip file to download
from requests import get
def download(url, filename):
response = get(url, stream = True)
with open(filename, "wb") as file:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
file.write(chunk)
print('Downloaded! ', filename)
with open('NVD_JSON_SOURCE_URLS.txt') as f:
for line in f:
filename = line.split('/')[-1]
url = line
download(url, filename)
The input works and it starts the downloads, just never completes them. Clearly I am missing something frustratingly simple here but after 2 days I am not getting any closer. Thanks.

I find chunking to be painful for instances like this in Python. This approach has worked for me frequently:
import requests
import shutil
def download_file(url, filename):
r = requests.get(url, stream=True)
with open(filename, 'wb') as f:
shutil.copyfileobj(r.raw, f)
This streams the whole file to memory then writes it. So wouldn't work for huge files, but you are only talking about a couple of MB should work fine.

I think line has some whitespace characters, so if you remove the whitespace characters from line using strip(), the code should work.
for line in f:
line = line.strip()
...
I tested it and it works for me.

Because there is a line break in the end of lines when you read the data from .txt file. So you should strip the line break in the first.
from requests import get
def download(url, filename):
response = get(url, stream = True)
with open(filename, "wb") as file:
for chunk in response.iter_content(chunk_size=1024):
if chunk:
file.write(chunk)
print('Downloaded! ', filename)
with open('NVD_JSON_SOURCE_URLS.txt') as f:
for line in f:
line = line.strip()
filename = line.split('/')[-1]
url = line
download(url, filename)

Related

Download bz2, Read compress files in memory (avoid memory overflow)

As title says, I'm downloading a bz2 file which has a folder inside and a lot of text files...
My first version was decompressing in memory, but Although it is only 90mbs when you uncomrpess it, it has 60 files of 750mb each.... Computer goes bum! obviusly cant handle like 40gb of ram XD)
So, The problem is that they are too big to keep all in memory at the same time... so I'm using this code that works but its sucks (Too slow):
response = requests.get('https:/fooweb.com/barfile.bz2')
# save file into disk:
compress_filepath = '{0}/files/sources/{1}'.format(zsets.BASE_DIR, check_time)
with open(compress_filepath, 'wb') as local_file:
local_file.write(response.content)
#We extract the files into folder
extract_folder = compress_filepath + '_ext'
with tarfile.open(compress_filepath, "r:bz2") as tar:
tar.extractall(extract_folder)
# We process one file at a time:
for filename in os.listdir(extract_folder):
filepath = '{0}/{1}'.format(extract_folder,filename)
file = open(filepath, 'r').readlines()
for line in file:
some_processing(line)
Is there a way I could make this without dumping it to disk... and only decompressing and reading one file from the .bz2 at a time?
Thank you very much for your time in advance, I hope somebody knows how to help me with this...
#!/usr/bin/python3
import sys
import requests
import tarfile
got = requests.get(sys.argv[1], stream=True)
with tarfile.open(fileobj=got.raw, mode='r|*') as tar:
for info in tar:
if info.isreg():
ent = tar.extractfile(info)
# now process ent as a file, however you like
print(info.name, len(ent.read()))
I did it this way:
response = requests.get(my_url_to_file)
memfile = io.BytesIO(response.content)
# We extract files in memory, one by one:
tar = tarfile.open(fileobj=memfile, mode="r:bz2")
for member_name in tar.getnames():
filecount+=1
file = tar.extractfile(member_name)
with open(file, 'r') as read_file:
for line in read_file:
process_line(line)

How to download a file using requests

I am using the requests library to download a file from a URL. This is my code
for tag in soup.find_all('a'):
if '.zip' in str(tag):
file_name = str(tag).strip().split('>')[-2].split('<')[0]
link = link_name+tag.get('href')
r = requests.get(link, stream=True)
with open(os.path.join(download_path, file_name), 'wb') as fd:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
fd.write(chunk)
And then I unzip the file using this code
unzip_path = os.path.join(download_path, file_name.split('.')[0])
with zipfile.ZipFile(os.path.join(download_path, file_name), 'r') as zip_ref:
zip_ref.extractall(unzip_path)
This code looks if there is a zip file in the provided page and then downloads the zipped file in a directory. Then it will unzip the file using the zipFile library.
The problem with this code is that sometimes the download is not complete. So for example if the zipped file is 312KB long only parts of it is downloaded. And then I get a BadZipFile error. But sometimes the entire file is downloaded correctly.
I tried the same without streaming and even that results in the same problem.
How do I check if all the chunks are downloaded properly.
Maybe this works:
r = requests.get(link)
with open(os.path.join(download_path, file_name), 'wb') as fd:
fd.write(r.content)

Python code to move file over socket

SO family, I trying to write an application where i can transfer files between two computers. I currently this working using something like this:
On client side
file = open(srcfile, 'r')
content = file.read()
file.close()
send_message(srcfile)
send_message(content)
On Server side:
filename = receive_message(message)
content = receive_message(message)
file = open(filename, 'w')
file.write(content)
file.close()
This seems to work for text files, but for other file types it doesn't work..
I'm thinking there has to be a better way. Any suggestions?
you need to use
file = open(srcfile, 'rb')
and
file = open(srcfile, 'wb')
respectivly ... the b means binary...

gzip a file quicker using Python?

I am attempting to gzip a file using python faster as some of my files are as as small as 30 MB and as large as 4 GB.
Is there a more efficient way of creating a gzip file than the following? Is there a way to optimize the following so that if a file is small enough to be placed in memory it can simply just read the whole chunk of the file to be read rather than do it on a per line basis?
with open(j, 'rb') as f_in:
with gzip.open(j + ".gz", 'wb') as f_out:
f_out.writelines(f_in)
Copy the file in bigger chunks using the shutil.copyfileobj() function. In this example, I'm using 16MiB blocks which is pretty reasonable.
MEG = 2**20
with open(j, 'rb') as f_in:
with gzip.open(j + ".gz", 'wb') as f_out:
shutil.copyfileobj(f_in, f_out, length=16*MEG)
You may find that calling out to gzip is faster for large files, especially if you plan to zip multiple files in parallel.
Instead of reading it line by line, you can read it at once.
Example:
import gzip
with open(j, 'rb') as f_in:
content = f_in.read()
f = gzip.open(j + '.gz', 'wb')
f.write(content)
f.close()
Find 2 almost identical methods for reading gzip files below:
A.) to load everything into memory --> can be a bad choice for very big files (several GB), because you can run out of memory
B.) Don't load everything into memory, line by line --> good for BIG files
adapted from
https://codebright.wordpress.com/2011/03/25/139/
and
https://www.reddit.com/r/Python/comments/2olhrf/fast_gzip_in_python/
http://pastebin.com/dcEJRs1i
import sys
if sys.version.startswith("3"):
import io
io_method = io.BytesIO
else:
import cStringIO
io_method = cStringIO.StringIO
A.)
def yield_line_gz_file(fn):
"""
:param fn: String (absolute path)
:return: GeneratorFunction (yields String)
"""
ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
fh = io_method(ph.communicate()[0])
for line in fh:
yield line
B.)
def yield_line_gz_file(fn):
"""
:param fn: String (absolute path)
:return: GeneratorFunction (yields String)
"""
ph = subprocess.Popen(["gzcat", fn], stdout=subprocess.PIPE)
for line in ph.stdout:
yield line

How do I automatically handle decompression when reading a file in Python?

I am writing some Python code that loops through a number of files and processes the first few hundred lines of each file. I would like to extend this code so that if any of the files in the list are compressed, it will automatically decompress while reading them, so that my code always receives the decompressed lines. Essentially my code currently looks like:
for f in files:
handle = open(f)
process_file_contents(handle)
Is there any function that can replace open in the above code so that if f is either plain text or gzip-compressed text (or bzip2, etc.), the function will always return a file handle to the decompressed contents of the file? (No seeking required, just sequential access.)
I had the same problem: I'd like my code to accept filenames and return a filehandle to be used with with, automatically compressed & etc.
In my case, I'm willing to trust the filename extensions and I only need to deal with gzip and maybe bzip files.
import gzip
import bz2
def open_by_suffix(filename):
if filename.endswith('.gz'):
return gzip.open(filename, 'rb')
elif filename.endswith('.bz2'):
return bz2.BZ2file(filename, 'r')
else:
return open(filename, 'r')
If we don't trust the filenames, we can compare the initial bytes of the file for magic strings (modified from https://stackoverflow.com/a/13044946/117714):
import gzip
import bz2
magic_dict = {
"\x1f\x8b\x08": (gzip.open, 'rb')
"\x42\x5a\x68": (bz2.BZ2File, 'r')
}
max_len = max(len(x) for x in magic_dict)
def open_by_magic(filename):
with open(filename) as f:
file_start = f.read(max_len)
for magic, (fn, flag) in magic_dict.items():
if file_start.startswith(magic):
return fn(filename, flag)
return open(filename, 'r')
Usage:
# cat
for filename in filenames:
with open_by_suffix(filename) as f:
for line in f:
print f
Your use-case would look like:
for f in files:
with open_by_suffix(f) as handle:
process_file_contents(handle)

Categories

Resources