Tornado: Read uploaded CSV file? - python

I want to do something like:
import csv
import tornado.web
class MainHandler(tornado.web.RequestHandler):
def post(self):
uploaded_csv_file = self.request.files['file'][0]
with uploaded_csv_file as csv_file:
for row in csv.reader(csv_file):
self.write(' , '.join(row))
But, uploaded_csv_file is not of type file.
What is the best practice here?
Sources:
http://docs.python.org/2/library/csv.html
http://docs.python.org/2/library/functions.html#open
https://stackoverflow.com/a/11911972/242933

As the documentation explains:
csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable.
So, if you have something which is not a file, but is an iterator over lines, that's fine. If it's not even an iterator over lines, just wrap it in one. For a trivial example, if it's something with a read_all() method, you can always do this:
uploaded_csv_file = self.request.files['file'][0]
contents = uploaded_csv_file.read_all()
lines = contents.splitlines()
for row in csv.reader(lines):
# ...
(Obviously you can merge steps together to make it briefer; I just wrote each step as a separate line to make it simpler to understand.)
Of course if the CSV files are large, and especially if they take a while to arrive and you've got a nice streaming interface, you probably don't want to read the whole thing at once this way. Most network server frameworks offer nice protocol adapters to, e.g., take a stream of bytes and give you a stream of lines. (For that matter, even socket.makefile() in the stdlib sort of does that…) But in your case, I don't think that's an issue.

Related

How to preprocess a text stream on the fly in Python?

What I need is a Python 3 function (or whatever) that would take a text stream (like sys.stdin or like that returned by open(file_name, "rt")) and return a text stream to be consumed by some other function but remove all the spaces, replace all tabs with commas and convert all the letters to lowercase on the fly (the "lazy" way) as the data is read by the consumer code.
I assume there is a reasonably easy way to do this in Python 3 like something similar to list comprehensions but don't know what exactly might it be so far.
I am not sure this is what you mean, but the easiest way i can think of is to inherit from file (the type returned from open) and override the read method to do all the things you want after reading the data. A simple implementation would be:
class MyFile(file):
def read(*args, **kwargs):
data = super().read(*args,**kwargs)
# process data eg. data.replace(' ',' ').replace('\t', ',').lower()
return data
I believe what you are looking for is the io module, more specifically a io.StringIO.
You can then use the open() method to get the initial data and modify, then pass it around:
with open(file_name, 'rt') as f:
stream = io.StringIO(f.read().replace(' ','').replace('\t',',').lower())

how does pickle know which to pick?

I have my pickle function working properly
with open(self._prepared_data_location_scalar, 'wb') as output:
# company1 = Company('banana', 40)
pickle.dump(X_scaler, output, pickle.HIGHEST_PROTOCOL)
pickle.dump(Y_scaler, output, pickle.HIGHEST_PROTOCOL)
with open(self._prepared_data_location_scalar, 'rb') as input_f:
X_scaler = pickle.load(input_f)
Y_scaler = pickle.load(input_f)
However, I am very curious how does pickle know which to load? Does it mean that everything has to be in the same sequence?
What you have is fine. It's a documented feature of pickle:
It is possible to make multiple calls to the dump() method of the same Pickler instance. These must then be matched to the same number of calls to the load() method of the corresponding Unpickler instance.
There is no magic here, pickle is a really simple stack-based language that serializes python objects into bytestrings. The pickle format knows about object boundaries: by design, pickle.dumps('x') + pickle.dumps('y') is not the same bytestring as pickle.dumps('xy').
If you're interested to learn some background on the implementation, this article is an easy read to shed some light on the python pickler.
wow I did not even know you could do this ... and I have been using python for a very long time... so thats totally awesome in my book, however you really should not do this it will be very hard to work with later(especially if it isnt you working on it)
I would recommend just doing
pickle.dump({"X":X_scalar,"Y":Y_scalar},output)
...
data = pickle.load(fp)
print "Y_scalar:",data['Y']
print "X_scalar:",data['X']
unless you have a very compelling reason to save and load the data like you were in your question ...
edit to answer the actual question...
it loads from the start of the file to the end (ie it loads them in the same order they were dumped)
Yes, pickle pick objects in order of saving.
Intuitively, pickle append to the end when it write (dump) to a file,
and read (load) sequentially the content from a file.
Consequently, order is preserved, allowing you to retrieve your data in the exact order you serialize it.

Python - Using an operator to assign a file name when using a class

I am a somewhat Python/programing newbie, and I am attempting to use a python class for the first time.
In this code I am trying to create a script to backup some files. I have 6 files in total that I want to back up regularly with this script so I thought that I would try and use the python Class to save me writing things out 6 times, and also to get practice using Classes.
In my code below I have things set up for just creating 1 instance of a class for now to test things. However, I have hit a snag. I can't seem to use the operator to assign the original filename and the back-up filename.
Is it not possible to use the operator for a filename when opening a file? Or am I doing things wrong.
class Back_up(object):
def __init__(self, file_name, back_up_file):
self.file_name = file_name
self.back_up_file = back_up_file
print "I %s and me %s" % (self.file_name, self.back_up_file)
with open('%s.txt', 'r') as f, open('{}.txt', 'w') as f2 % (self.file_name, self.back_up_file):
f_read = read(f)
f2.write(f_read)
first_back_up = Back_up("syn1_ready", "syn1_backup")
Also, line #7 is really long, any tips on how to shorten it are appreciated.
Thanks
Darren
If you just want your files backed up, may I suggest using shutil.copy()?
As for your program:
If you want to substitute in a string to build a filename, you can do it. But your code doesn't do it.
You have this:
with open('%s.txt', 'r') as f, open('{}.txt', 'w') as f2 % (self.file_name, self.back_up_file):
Try this instead:
src = "%s.txt" % self.file_name
dest = "{}.txt".format(self.back_up_file)
with open(src, "rb") as f, open(dest, "wb") as f2:
# copying code goes here
The % operator operates on a string. The .format() method call is a method on a string. Either way, you need to do the operation with the string; you can't have two with statements and then try to use these operators at the end of the with statements line.
You don't have to use explicit temp variables like I show here, but it's a good way to make the code easy to read, while greatly shortening the length of the with statements line.
Your code to copy the files will read all the file data into memory at one time. That will be fine for a small file. For a large file, you should use a loop that calls .read(CHUNK_SIZE) where CHUNK_SIZE is a maximum amount to read in a single chunk. That way if you ever back up a really large file on a computer with limited memory, it will simply work rather than filling the computer's memory and making the computer start swapping to disk.
Try simplicity :)
Your line 7 is not going to parse. Split it using intermediate variables:
source_fname = "%s.txt" % self.file_name
target_fname = "%s.txt" % self.back_up_file
with open(source_fname) as source, open(target_fname) as target:
# do your thing
Also, try hard avoiding inconsistent and overly generic attribute names, like file_name, when you have two files to operate on.
Your copy routine is not going to be very efficient, too. It tries to read the entire file into memory, then write it. If I were you I'd call rsync or something similar via popen() and feed it with proper list of files to operate on. Most probably I'd use bash for that, though Python may be fine, too.

Read file in chunks - RAM-usage, reading strings from binary files

I'd like to understand the difference in RAM-usage of this methods when reading a large file in python.
Version 1, found here on stackoverflow:
def read_in_chunks(file_object, chunk_size=1024):
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open(file, 'rb')
for piece in read_in_chunks(f):
process_data(piece)
f.close()
Version 2, I used this before I found the code above:
f = open(file, 'rb')
while True:
piece = f.read(1024)
process_data(piece)
f.close()
The file is read partially in both versions. And the current piece could be processed. In the second example, piece is getting new content on every cycle, so I thought this would do the job without loading the complete file into memory.
But I don't really understand what yield does, and I'm pretty sure I got something wrong here. Could anyone explain that to me?
There is something else that puzzles me, besides of the method used:
The content of the piece I read is defined by the chunk-size, 1KB in the examples above. But... what if I need to look for strings in the file? Something like "ThisIsTheStringILikeToFind"?
Depending on where in the file the string occurs, it could be that one piece contains the part "ThisIsTheStr" - and the next piece would contain "ingILikeToFind". Using such a method it's not possible to detect the whole string in any piece.
Is there a way to read a file in chunks - but somehow care about such strings?
yield is the keyword in python used for generator expressions. That means that the next time the function is called (or iterated on), the execution will start back up at the exact point it left off last time you called it. The two functions behave identically; the only difference is that the first one uses a tiny bit more call stack space than the second. However, the first one is far more reusable, so from a program design standpoint, the first one is actually better.
EDIT: Also, one other difference is that the first one will stop reading once all the data has been read, the way it should, but the second one will only stop once either f.read() or process_data() throws an exception. In order to have the second one work properly, you need to modify it like so:
f = open(file, 'rb')
while True:
piece = f.read(1024)
if not piece:
break
process_data(piece)
f.close()
starting from python 3.8 you might also use an assignment expression (the walrus-operator):
with open('file.name', 'rb') as file:
while chunk := file.read(1024):
process_data(chunk)
the last chunk may be smaller than CHUNK_SIZE.
as read() will return b"" when the file has been read the while loop will terminate.
I think probably the best and most idiomatic way to do this would be to use the built-in iter() function along with its optional sentinel argument to create and use an iterable as shown below. Note that the last chunk might be less that the requested chunk size if the file size isn't an exact multiple of it.
from functools import partial
CHUNK_SIZE = 1024
filename = 'testfile.dat'
with open(filename, 'rb') as file:
for chunk in iter(partial(file.read, CHUNK_SIZE), b''):
process_data(chunk)
Update: Don't know when it was added, but almost exactly what's above is in now shown as an example in the official documentation of the iter() function.

File Reading Options Enquiry (Python)

I am a programming student for the semester. In class we have been learning about file opening, reading and writing.
We have used a_reader to achieve such tasks for file opening. I have been reading our associated text/s and I have noticed that there is a CSV reader option which I have been using.
I wanted to know if there were anymore possible ways to open/read a file as I am trying to grow my knowledge base in python and its associated contents.
EDIT:
I was referring to CSV more specifically as that is the type of files we use at the moment. We have learnt about CSV Reader and a_reader and an example from one of our lectures is shown below.
def main():
a_reader = open('IDCJAC0016_009225_1800_Data.csv', 'rU')
file_data = a_reader.read()
a_reader.close()
print file_data
main()
It may seem overly broad but I have no knowledge which is why I am asking is there more than just the 2 ways above. If there is can someone who knows provide the types so I can read up on and research on them.
If you're asking about places to store things, the first interfaces you'll meet are files and sockets (pretend a network connection is like a file, see http://docs.python.org/2/library/socket.html).
If you mean file formats (like csv), there are many! Probably you can think of many yourself, but besides csv there are html files, pictures (png, jpg, gif), archive formats (tar, zip), text files (.txt!), python files (.py). The list goes on.
There are many ways to read files in different ways.
Just plain open will take a filename and open it as a sequence of lines. Or, you can just call read() on it, and it will read the whole file at once into one giant string.
codecs.open will take a filename and a character set, and decode each line to Unicode automatically. Or, again, you can just call read() on it, and it will read and decode the whole file at once into one giant Unicode string.
csv.reader will take a file or file-like object, and read it as a sequence of CSV rows. There's no direct equivalent of read()—but you can turn any sequence into a list by just calling list on it, so list(my_reader) will give you a list of rows (each of which is, itself, a list).
zipfile.ZipFile will take a filename, or a file or file-like object, and read it as a ZIP archive. This doesn't go line by line, of course, but you can go archived file by archived file. Or you can do fancier things, like search for archived files by name.
There are modules for reading JSON and XML documents, different ways of handling binary files, and so on. Some of them work differently—for example, you can search an XML document as a tree with one module, or go element by element with a different one.
Python has a pretty extensive standard library, and you can find the documentation online. Every module that seems like it should be able to work on files, probably can.
And, beyond what comes in the standard library, PyPI, the Python Package Index has thousands of additional modules. Looking for a way to read YAML documents? Search PyPI for yaml and you'll find it.
Finally, Python makes it very easy to add things like this on your own. The skeleton of a function like csv.reader is as simple as this:
def reader(fileobj):
for line in fileobj:
yield parse_one_csv_line(line)
You can replace that parse_one_csv_line with anything you want, and you've got a custom reader. For example, here's an uppercase_reader:
def uppercase_reader(fileobj):
for line in fileobj:
yield line.upper()
In fact, you can even write the whole thing in one line:
shouts = (line.upper() for line in fileobj)
And the best thing is that, as long as your reader only yields one line at a time, your reader is itself a file-like object, so you can pass uppercase_reader(fileobj) to csv.reader and it works just fine.

Categories

Resources