Large Zip Files with Zipfile Module Python - python

I have never used the zip file module before. I have a directory that contains thousands of zip files i need to process. These files can be up to 6GB big. I have looked through some documentation but a lot of them are not clear on what the best methods are for reading large zip files without needing to extract.
I stumbled up this: Read a large zipped text file line by line in python
So in my solution I tried to emulate it and use it like I would reading a normal text file with the with open function
with open(odfslogp_obj, 'rb', buffering=102400) as odfslog
So I wrote the following based off the answer from that link:
for odfslogp_obj in odfslogs_plist:
with zipfile.ZipFile(odfslogp_obj, mode='r') as z:
with z.open(buffering=102400) as f:
for line in f:
print(line)
But this gives me an "unexpected keyword" error for z.open()
Question is, is there documentation that explains what keywords, the z.open() function would take? I only found one for the .ZipFile() function.
I wanna make sure my code isn't using up too much memory while processing these files line by line.
odfslogp_obj is a Path object btw
When I take off the buffering and just have z.open(), I get an error saying: TypeError: open() missing 1 required positional argument: 'name'

Once you've opened the zipfile, you still need to open the individual files it contains. That the second z.open you had problems with. Its not the builtin python open and it doesn't have a "buffering" parameter. See ZipFile.open
Once the zipfile is opened you can enumate its files and open them in turn. ZipFile.open opens in binary mode, which may be a different problem, depending on what you want to do with the file.
for odfslogp_obj in odfslogs_plist:
with zipfile.ZipFile(odfslogp_obj, mode='r') as z:
for name in z.namelist():
with z.open(name) as f:
for line in f:
print(line)

Related

Dill deletes object when using "load"

I'm having an error that is driving me nuts. I generate some numerical simulation data sim_data.dill and save it to a directory on my computer using
with open(os.path.join(original_directory, 'sim_data.dill'), 'w' as f:
dill.dump(outputs, f)
This data is about 1 Gb and takes a while to generate. Now, I copied that file from original_directory to new_directory when I try to load it from a different program using
simfile = '/new_directory/sim_data.dill'
with open(simfile, 'r') as f:
outputs = dill.load(f)
One of two things happens:
the program says the file is missing with UnpicklingError: [Errno 2] No such file or directory: .../original_directory/sim_data.dill. This means dill puts in the original_directory in the metadata of the file and refuses to open it when the file is moved; truly appalling behavior.
when I copy the file back to new_directory, trying to open it gives an EOFError and dill changes the file to zero bytes, essentially deleting it. This is even worse.
I can read the file just fine by using a standard with open(simfile, 'r') as f; print f.readlines(), but obviously this does not help when trying to recover the internal class structure of the files.
Apparently this is normal behavior for dill; please see:
https://github.com/uqfoundation/dill/issues/296
Paraphrasing: the file location is part of the file handle to be pickled, and so unpickling it without that information is impossible. This means, apparently, that if you save a .dill file in one location, move the file manually (for example to a more convenient directory), and then try to open it again, it won't work.
In terms of the deletion issue, the author of the post above recommends to use fmode=FMODE_PRESERVEDATA or one of the other file modes listed at
https://github.com/matsjoyce/dill/blob/087c00899ef55f31d36e7aee51a958b17daf8c91/dill/dill.py#L136-L145

Python Subprocess for Notepad

I am trying to open Notepad using popen and write something into it. I can't get my head around it. I can open Notepad using command:
notepadprocess=subprocess.Popen('notepad.exe')
I am trying to identify how can I write anything in the text file using python. Any help is appreciated.
You can at first write something into txt file (ex. foo.txt) and then open it with notepad:
import os
f = open('foo.txt','w')
f.write('Hello world!')
f.close()
os.system("notepad.exe foo.txt")
You may be confusing the concept of (text) file with the processes that manipulate them.
Notepad is a program, of which you can create a process. A file, on the other hand, is just a structure on your hard drive.
From a programming standpoint, Notepad doesn't edit files. It:
reads a file into computer memory
modifies the content of that memory
writes that memory back into a file (which could be similarly named, or otherwise - which is known as the "Save as" operation).
Your program, just as any other program, can manipulate files, just as notepad does. In particular, you can perform exactly the same sequence as Notepad:
my_file= "myfile.txt" #the name/path of the file
with open(file, "rb") as f: #open the file for reading
content= f.read() #read the file into memory
content+= "mytext" #change the memory
with open(file, "wb") as f: #open the file for writing
f.write( content ) #write the memory into the file
Found the exact solution from Alex K's comment. I used pywinauto to perform this task.

Reading a .txt file in python

I have use the following code to read a .txt file:
f = os.open(os.path.join(self.dirname, self.filename), os.O_RDONLY)
And when I want to output the content I use this:
os.read(f, 10);
Which means that this method reads 10 bytes from the beginning of the file on. While I need to read the content as much as it is, using some values such as -1 and so. What should I do?
You have two options:
Call os.read() repeatedly.
Open the file using the open() built-in (as opposed to os.open()), and just call f.read() with no arguments.
The second approach carries certain risk, in that you might run into memory issues if the file is very large.

Opening And Reading Large Numbers of Files in Python

I have 37 data files that I need to open and analyze using python. Rather than brute force my code with a lot of open() and close() statements, is there a concise way to open and read from a large number of files?
You are going to have to open and close a file handle for each file you are hoping to read from. What is your aversion to doing it this way?
Are you looking for perhaps good way to determine which files need to be read?
Use a dictionary of filenames to file handles and then iterate over the items. Or a list of tuples. Or two-dimensional arrays. Or or or ...
Use the standard library fileinput module
Pass in the data files on the command line and process like this
import fileinput
for line in fileinput.input():
process(line)
This iterates over all the lines of all the files passed in on the command line. This module also provides helper functions to let you know which file and line you are on currently.
Use the arcane functionality known as a function.
def slurp(filename):
"""slurp will cleanly read in a file's contents, cleaning up after itself"""
# Using the 'with' statement will automagically close
# the file handle when you're done.
with open(filename, "r") as fh:
# if the files are too big to keep in-memory, then read by chunks
# instead and process the data into smaller data structures as needed.
return fh.read()
data = [ slurp(filename) for filename in ["data1.dat", "data2.dat", "data3.dat"]]
You can also combine the entire thing:
for filename in ["a.dat", "b.dat", "c.dat"]:
with open(filename,"r") as fh:
for line in fh:
process_line(line)
And so on...

File Reading Options Enquiry (Python)

I am a programming student for the semester. In class we have been learning about file opening, reading and writing.
We have used a_reader to achieve such tasks for file opening. I have been reading our associated text/s and I have noticed that there is a CSV reader option which I have been using.
I wanted to know if there were anymore possible ways to open/read a file as I am trying to grow my knowledge base in python and its associated contents.
EDIT:
I was referring to CSV more specifically as that is the type of files we use at the moment. We have learnt about CSV Reader and a_reader and an example from one of our lectures is shown below.
def main():
a_reader = open('IDCJAC0016_009225_1800_Data.csv', 'rU')
file_data = a_reader.read()
a_reader.close()
print file_data
main()
It may seem overly broad but I have no knowledge which is why I am asking is there more than just the 2 ways above. If there is can someone who knows provide the types so I can read up on and research on them.
If you're asking about places to store things, the first interfaces you'll meet are files and sockets (pretend a network connection is like a file, see http://docs.python.org/2/library/socket.html).
If you mean file formats (like csv), there are many! Probably you can think of many yourself, but besides csv there are html files, pictures (png, jpg, gif), archive formats (tar, zip), text files (.txt!), python files (.py). The list goes on.
There are many ways to read files in different ways.
Just plain open will take a filename and open it as a sequence of lines. Or, you can just call read() on it, and it will read the whole file at once into one giant string.
codecs.open will take a filename and a character set, and decode each line to Unicode automatically. Or, again, you can just call read() on it, and it will read and decode the whole file at once into one giant Unicode string.
csv.reader will take a file or file-like object, and read it as a sequence of CSV rows. There's no direct equivalent of read()—but you can turn any sequence into a list by just calling list on it, so list(my_reader) will give you a list of rows (each of which is, itself, a list).
zipfile.ZipFile will take a filename, or a file or file-like object, and read it as a ZIP archive. This doesn't go line by line, of course, but you can go archived file by archived file. Or you can do fancier things, like search for archived files by name.
There are modules for reading JSON and XML documents, different ways of handling binary files, and so on. Some of them work differently—for example, you can search an XML document as a tree with one module, or go element by element with a different one.
Python has a pretty extensive standard library, and you can find the documentation online. Every module that seems like it should be able to work on files, probably can.
And, beyond what comes in the standard library, PyPI, the Python Package Index has thousands of additional modules. Looking for a way to read YAML documents? Search PyPI for yaml and you'll find it.
Finally, Python makes it very easy to add things like this on your own. The skeleton of a function like csv.reader is as simple as this:
def reader(fileobj):
for line in fileobj:
yield parse_one_csv_line(line)
You can replace that parse_one_csv_line with anything you want, and you've got a custom reader. For example, here's an uppercase_reader:
def uppercase_reader(fileobj):
for line in fileobj:
yield line.upper()
In fact, you can even write the whole thing in one line:
shouts = (line.upper() for line in fileobj)
And the best thing is that, as long as your reader only yields one line at a time, your reader is itself a file-like object, so you can pass uppercase_reader(fileobj) to csv.reader and it works just fine.

Categories

Resources