Parsing in memory CSV files from zip archives

Parsing in memory CSV files from zip archives - python

I'm working on a new library which will allow the user to parse any file (xlsx, csv, json, tar, zip, txt) into generators.
Now I'm stuck at zip archive and when I try to parse a csv from it, I get
io.UnsupportedOperation: seek immediately after elem.seek(0). The csv file is a simple one 4x4 rows and columns. If I parse the csv using the csv_parser I get what I want, but trying to parse it from a zip archive... boom. Error!
with open("/Users/ro/Downloads/archive_file/csv.zip", 'r') as my_file_file:
asd = parse_zip(my_file_file)
print asd
Where parse_zip is:
def parse_zip(element):
"""Function for manipulating zip files"""
try:
my_zip = zipfile.ZipFile(element, 'r')
except zipfile.BadZipfile:
raise err.NestedArchives(element)
else:
my_file = my_zip.open('corect_csv.csv')
# print my_file
my_mime = csv_tsv_parser.parse_csv_tsv(my_file)
print list(my_mime)
And parse_cvs_tsv is:
def _csv_tsv_parser(element):
"""Helper function for csv and tsv files that return an generator"""
for row in element:
if any(s for s in row):
yield row
def parse_csv_tsv(elem):
"""Function for manipulating all the csv files"""
dialect = csv.Sniffer().sniff(elem.readline())
elem.seek(0)
data_file = csv.reader(elem, dialect)
read_data = _csv_tsv_parser(data_file)
yield '', read_data
Where am I wrong? Is the way I'm opening the file OK or...?

Zipfile.open returns a file-like ZipExtFile object that inherits from io.BufferedIOBase. io.BufferedIOBase does not support seek (only text streams in the io module support seek), hence the exception.
However, ZipExtFile does provide a peek method, which will return a number of bytes without moving the file pointer. So changing
dialect = csv.Sniffer().sniff(elem.readline())
elem.seek(0)
to
num_bytes = 128 # number of bytes to read
dialect = csv.Sniffer().sniff(elem.peek(n=num_bytes))
solves the problem.

Related

convert csv data to dict without writing file to disk

Here is my scenario: I have a zip file that I am downloading with requests into memory rather than writing a file. I am unzipping the data into an object called myzipfile. Inside the zip file is a csv file. I would like to convert each row of the csv data into a dictionary. Here is what I have so far.
import csv
from io import BytesIO
import requests
# other imports etc.
r = requests.get(url=fileurl, headers=headers, stream=True)
filebytes = BytesIO(r.content)
myzipfile = zipfile.ZipFile(filebytes)
for name in myzipfile.namelist():
mycsv = myzipfile.open(name).read()
for row in csv.DictReader(mycsv): # it fails here.
print(row)
errors:
Traceback (most recent call last):
File "/usr/lib64/python3.7/csv.py", line 98, in fieldnames
self._fieldnames = next(self.reader)
_csv.Error: iterator should return strings, not int (did you open the file in text mode?)
Looks like csv.DictReader(mycsv) expects a file object instead of raw data. How do I convert the rows in the mycsv object data (<class 'bytes'>) to a list of dictionaries? I'm trying to accomplish this without writing a file to disk and working directly from csv objects in memory.

The DictReader expects a file or file-like object: we can satisfy this expectation by loading the zipped file into an io.StringIO instance.
Note that StringIO expects its argument to be a str, but reading a file from the zipfile returns bytes, so the data must be decoded. This example assumes that the csv was originally encoded with the local system's default encoding. If that is not the case the correct encoding must be passed to decode().
for name in myzipfile.namelist():
data = myzipfile.open(name).read().decode()
mycsv = io.StringIO(data)
reader = csv.DictReader(mycsv)
for row in reader:
print(row)

dict_list = [] # a list
reader = csv.DictReader(open('yourfile.csv', 'rb'))
for line in reader: # since we used DictReader, each line will be saved as a dictionary
dict_list.append(line)

Converting dictionary as Json and append to a file

Scenario is i need to convert dictionary object as json and write to a file . New Dictionary objects would be sent on every write_to_file() method call and i have to append Json to the file .Following is the code
def write_to_file(self, dict=None):
f = open("/Users/xyz/Desktop/file.json", "w+")
if json.load(f)!= None:
data = json.load(f)
data.update(dict)
f = open("/Users/xyz/Desktop/file.json", "w+")
f.write(json.dumps(data))
else:
f = open("/Users/xyz/Desktop/file.json", "w+")
f.write(json.dumps(dict)
Getting this error "No JSON object could be decoded" and Json is not written to the file. Can anyone help ?

this looks overcomplex and highly buggy. Opening the file several times, in w+ mode, and reading it twice won't get you nowhere but will create an empty file that json won't be able to read.
I would test if the file exists, if so I'm reading it (else create an empty dict).
this default None argument makes no sense. You have to pass a dictionary or the update method won't work. Well, we can skip the update if the object is "falsy".
don't use dict as a variable name
in the end, overwrite the file with a new version of your data (w+ and r+ should be reserved to fixed size/binary files, not text/json/xml files)
Like this:
def write_to_file(self, new_data=None):
# define filename to avoid copy/paste
filename = "/Users/xyz/Desktop/file.json"
data = {} # in case the file doesn't exist yet
if os.path.exists(filename):
with open(filename) as f:
data = json.load(f)
# update data with new_data if non-None/empty
if new_data:
data.update(new_data)
# write the updated dictionary, create file if
# didn't exist
with open(filename,"w") as f:
json.dump(data,f)

access logfile and return/open all files written in there

For Reference
I have a python class which is supposed to unpack an archive and recursively iterate over the directory structure and then return the files for further processing. In my case I want to hash those files. I'm struggling with returning the files. Here is my take.
I created an unzip function, a function which creates a log-file with all the paths of the files which were unpacked. Then I want to access this log-file and return ALL of the files so I can use them in another python class for further processing.This doesn't seem to work yet.
Structure of log-file:
/home/usr/Downloads/outdir/XXX.log
/home/usr/Downloads/outdir/Code/XXX.py
/home/usr/Downloads/outdir/Code/XXX.py
/home/usr/Downloads/outdir/Code/XXX.py
Code of interest:
#staticmethod
def read_received_files(from_log):
with open(from_log, 'r') as data:
data = data.readlines()
for lines in data:
\\ This does not seem to work zet
read_files = open(lines.strip())
return read_files

I believe that's what you're looking for:
#staticmethod
def read_received_files(from_log):
files = []
with open(from_log, 'r') as data:
for line in data:
files.append(open(line.strip()))
return files
You returned while iterating, preventing from opening the other files.

Since you are primarily after the meta data and hash of the files stored in the zip file, but not the file itself, there is no need to extract the files to the file system.
Instead you can use the ZipFile.open() method to access the contents of the file through a file-like object. Meta data could be gathered using the ZipInfo object for each file. Here's an example which gets file name and file size as meta data, and the hash of the file.
import hashlib
import zipfile
from collections import namedtuple
def get_files(archive):
FileInfo = namedtuple('FileInfo', ('filename', 'size', 'hash'))
with zipfile.ZipFile(archive) as zf:
for info in zf.infolist():
if not info.filename.endswith('/'): # exclude directories
f = zf.open(info)
hash_ = hashlib.md5(f.read()).hexdigest()
yield FileInfo(info.filename, info.file_size, hash_)
for f in get_files('some_file.zip'):
print('{}: {} {} bytes'.format(f.hash, f.filename, f.size))

How do I automatically handle decompression when reading a file in Python?

I am writing some Python code that loops through a number of files and processes the first few hundred lines of each file. I would like to extend this code so that if any of the files in the list are compressed, it will automatically decompress while reading them, so that my code always receives the decompressed lines. Essentially my code currently looks like:
for f in files:
handle = open(f)
process_file_contents(handle)
Is there any function that can replace open in the above code so that if f is either plain text or gzip-compressed text (or bzip2, etc.), the function will always return a file handle to the decompressed contents of the file? (No seeking required, just sequential access.)

I had the same problem: I'd like my code to accept filenames and return a filehandle to be used with with, automatically compressed & etc.
In my case, I'm willing to trust the filename extensions and I only need to deal with gzip and maybe bzip files.
import gzip
import bz2
def open_by_suffix(filename):
if filename.endswith('.gz'):
return gzip.open(filename, 'rb')
elif filename.endswith('.bz2'):
return bz2.BZ2file(filename, 'r')
else:
return open(filename, 'r')
If we don't trust the filenames, we can compare the initial bytes of the file for magic strings (modified from https://stackoverflow.com/a/13044946/117714):
import gzip
import bz2
magic_dict = {
"\x1f\x8b\x08": (gzip.open, 'rb')
"\x42\x5a\x68": (bz2.BZ2File, 'r')
}
max_len = max(len(x) for x in magic_dict)
def open_by_magic(filename):
with open(filename) as f:
file_start = f.read(max_len)
for magic, (fn, flag) in magic_dict.items():
if file_start.startswith(magic):
return fn(filename, flag)
return open(filename, 'r')
Usage:
# cat
for filename in filenames:
with open_by_suffix(filename) as f:
for line in f:
print f
Your use-case would look like:
for f in files:
with open_by_suffix(f) as handle:
process_file_contents(handle)

Python zipfile, bizarre limit to number of files: "folder is invalid"

The computer is toying with me, I know it!
I am creating a zip folder in Python. The individual files are generated in memory and then the whole thing is zipped and saved to a file. I am allowed to add 9 files to the zip. I am allowed to add 11 files to the zip. But 10, no, not 10 files. The zip file IS saved to my computer, but I'm not allowed to open it; Windows says that the compressed zipped folder is invalid.
I use the code below, which I got from another stackoverflow question. It appends 10 files and saves the zipped folder. When I click on the folder, I cannot extract it. BUT, remove one of the appends() and it's fine. Or, add another append and it works!
What am I missing here? How can I make this work every time?
imz = InMemoryZip()
imz.append("1a.txt", "a").append("2a.txt", "a").append("3a.txt", "a").append("4a.txt", "a").append("5a.txt", "a").append("6a.txt", "a").append("7a.txt", "a").append("8a.txt", "a").append("9a.txt", "a").append("10a.txt", "a")
imz.writetofile("C:/path/test.zip")
import zipfile
import StringIO
class InMemoryZip(object):
def __init__(self):
# Create the in-memory file-like object
self.in_memory_zip = StringIO.StringIO()
def append(self, filename_in_zip, file_contents):
'''Appends a file with name filename_in_zip and contents of
file_contents to the in-memory zip.'''
# Get a handle to the in-memory zip in append mode
zf = zipfile.ZipFile(self.in_memory_zip, "a", zipfile.ZIP_DEFLATED, False)
# Write the file to the in-memory zip
zf.writestr(filename_in_zip, file_contents)
# Mark the files as having been created on Windows so that
# Unix permissions are not inferred as 0000
for zfile in zf.filelist:
zfile.create_system = 0
return self
def read(self):
'''Returns a string with the contents of the in-memory zip.'''
self.in_memory_zip.seek(0)
return self.in_memory_zip.read()
def writetofile(self, filename):
'''Writes the in-memory zip to a file.'''
f = file(filename, "w")
f.write(self.read())
f.close()

You should use the 'wb' mode when creating the file you are saving to the file system. This will ensure that the file is written in binary.
Otherwise, any time a newline (\n) character happens to be encountered in the zip file python will replace it to match the windows line ending (\r\n). The reason 10 files is a problem is that 10 happens to be the code for \n.
So your write function should look like this:
def writetofile(self, filename):
'''Writes the in-memory zip to a file.'''
f = file(filename, 'wb')
f.write(self.read())
f.close()
This should fix your problem and work for the files in your example. Although, in your case you might find it easier to write the zip file directly to the file system like this code which includes some of the comments from above:
import StringIO
import zipfile
class ZipCreator:
buffer = None
def __init__(self, fileName=None):
if fileName:
self.zipFile = zipfile.ZipFile(fileName, 'w', zipfile.ZIP_DEFLATED, False)
return
self.buffer = StringIO.StringIO()
self.zipFile = zipfile.ZipFile(self.buffer, 'w', zipfile.ZIP_DEFLATED, False)
def addToZipFromFileSystem(self, filePath, filenameInZip):
self.zipFile.write(filePath, filenameInZip)
def addToZipFromMemory(self, filenameInZip, fileContents):
self.zipFile.writestr(filenameInZip, fileContents)
for zipFile in self.zipFile.filelist:
zipFile.create_system = 0
def write(self, fileName):
if not self.buffer: # If the buffer was not initialized the file is written by the ZipFile
self.zipFile.close()
return
f = file(fileName, 'wb')
f.write(self.buffer.getvalue())
f.close()
# Use File Handle
zipCreator = ZipCreator('C:/path/test.zip')
# Use Memory Buffer
# zipCreator = ZipCreator()
for i in range(1, 10):
zipCreator.addToZipFromMemory('test/%sa.txt' % i, 'a')
zipCreator.write('C:/path/test.zip')
Ideally, you would probably use separate classes for an in-memory zip and a zip that is tied to the file system from the beginning. I have also seem some issues with the in-memory zip when folders are added which are difficult to recreate and which I am still trying to track down.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing in memory CSV files from zip archives - python

Related

convert csv data to dict without writing file to disk

Converting dictionary as Json and append to a file

access logfile and return/open all files written in there

How do I automatically handle decompression when reading a file in Python?

Python zipfile, bizarre limit to number of files: "folder is invalid"

Categories

Resources