How to preprocess a text stream on the fly in Python? - python

What I need is a Python 3 function (or whatever) that would take a text stream (like sys.stdin or like that returned by open(file_name, "rt")) and return a text stream to be consumed by some other function but remove all the spaces, replace all tabs with commas and convert all the letters to lowercase on the fly (the "lazy" way) as the data is read by the consumer code.
I assume there is a reasonably easy way to do this in Python 3 like something similar to list comprehensions but don't know what exactly might it be so far.

I am not sure this is what you mean, but the easiest way i can think of is to inherit from file (the type returned from open) and override the read method to do all the things you want after reading the data. A simple implementation would be:
class MyFile(file):
def read(*args, **kwargs):
data = super().read(*args,**kwargs)
# process data eg. data.replace(' ',' ').replace('\t', ',').lower()
return data

I believe what you are looking for is the io module, more specifically a io.StringIO.
You can then use the open() method to get the initial data and modify, then pass it around:
with open(file_name, 'rt') as f:
stream = io.StringIO(f.read().replace(' ','').replace('\t',',').lower())

Related

Search for a word, and modify the whole line in Python text processing

This is my carDatabase.txt
CarID:c01 ModelName:honda VehicleType:city Price:20
CarID:c02 ModelName:honda VehicleType:x Price:30
I want to search for the carID and be only able to modify the whole line without interrupting others
my current code is here:
# Converting txt data into a string and modify
carsDatabaseFile = open('carsDatabase.txt', 'r')
allDataFromDatabase = [line.split(',') for line in carsDatabaseFile.readlines()]
Note:
Your question has a couple of issues: your sample from carDatabase.txt looks like it is tab-delimited, but your current code looks like it is splitting the line around the ',' character. This also looks like a place where a list comprehension might be hurting you more than it is helping you. Break that up into a for-loop if you're trying to add some logic to manipulate a single line.
For looking at CSV files, I would highly recommend using pandas for general manipulation of data in comma ceparated as well as a number of other formats.
That said, if you are truly restricted to only using built-in packages, or you are looking at this as a learning exercise, and your goal is to directly manipulate just one line of that file, what you are looking for is the seek method. You can use this in combination with the tell method ( documented just blow seek in the above link ) to find where you are in the file.
Write a for loop to identify which line in the file you are looking for
From there, you can get the output of tell() to find the specific place in the file you are trying to manipulate
Using the output from the above two steps, you can set the file pointer to a specific location using the seek() method (by byte: files are really stored as one dimensional).
You can now use the write() method to directly update the file at the location you determined above.

What are the appropriate argument/return types for a function to take binary files/streams/filenames and convert them to readable text format?

I have a function that's intended to take a binary file format and convert it to a readable text format, e.g.:
def textualize(binary_stuff):
# magic to turn binary stuff into text
return text_stuff
There are a few different types I could accept as input or produce as output, and I'm unsure what to use. Here are some options and corresponding objections I can think of:
Take a bytes object as input and return a string.
Problematic if, say, the input is originating from a huge file that now has to be read into memory.
Take a file-like object as input, read it, and return a string.
Relies on the caller to open the file in the right mode.
The asymmetry of this disturbs me for reasons I can't quite put a finger on.
Take two file-like objects; read from one and write to the other instead of returning anything.
Again relies on the caller to open the files in the right mode.
Makes the most common cases (named file to named file, or bytes to string) more unwieldly than they need to be.
Take two filenames and handle opening stuff myself.
What if the caller wants to convert data that isn't in a named file?
Accept multiple possible input types.
Possibly complicated to program.
Still leaves the question of what to return.
Is there an established Right Thing to do for conversions like this? Are there additional tradeoffs I'm missing?
You could do this how the json module does this. One function for strings and another for files. And leave the opening and closing of files to the caller -- gives the caller more flexibility. You could then use functools.singledispatch to provide ways to dispatch your functions
eg.
from functools import singledispatch
from io import BytesIO, StringIO, IOBase, TextIOBase
#singledispatch
def textualise(input, output):
if not isinstance(input, IOBase):
raise TypeError(input)
if not isinstance(output, TextIOBase):
raise TypeError(output)
data = input.read().decode("utf-8")
output.write(data)
output.flush()
#textualise.register(bytes)
def textualise_bytes(bytes_):
input = BytesIO(bytes_)
output = StringIO()
textualise(input, output)
return output.getvalue()
#textualise.register(str)
def textualise_filenames(in_filename, out_filename):
with open(in_filename, "rb") as input, open(out_filename, "wt") as output:
textualise(input, output)
s = textualise(b"some text")
assert s == "some text"
textualise("inputfile.txt", "outputfile.txt")
I would personally avoid the the third form since bytes objects are also valid filenames. For example, textualise(b"inputfile.txt", "outputfile.txt") would get dispatched to the wrong function (textualise_bytes).

Tornado: Read uploaded CSV file?

I want to do something like:
import csv
import tornado.web
class MainHandler(tornado.web.RequestHandler):
def post(self):
uploaded_csv_file = self.request.files['file'][0]
with uploaded_csv_file as csv_file:
for row in csv.reader(csv_file):
self.write(' , '.join(row))
But, uploaded_csv_file is not of type file.
What is the best practice here?
Sources:
http://docs.python.org/2/library/csv.html
http://docs.python.org/2/library/functions.html#open
https://stackoverflow.com/a/11911972/242933
As the documentation explains:
csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable.
So, if you have something which is not a file, but is an iterator over lines, that's fine. If it's not even an iterator over lines, just wrap it in one. For a trivial example, if it's something with a read_all() method, you can always do this:
uploaded_csv_file = self.request.files['file'][0]
contents = uploaded_csv_file.read_all()
lines = contents.splitlines()
for row in csv.reader(lines):
# ...
(Obviously you can merge steps together to make it briefer; I just wrote each step as a separate line to make it simpler to understand.)
Of course if the CSV files are large, and especially if they take a while to arrive and you've got a nice streaming interface, you probably don't want to read the whole thing at once this way. Most network server frameworks offer nice protocol adapters to, e.g., take a stream of bytes and give you a stream of lines. (For that matter, even socket.makefile() in the stdlib sort of does that…) But in your case, I don't think that's an issue.

In python, is there a way for re.finditer to take a file as input instead of a string?

Let's say I have a really large file foo.txt and I want to iterate through it doing something upon finding a regular expression. Currently I do this:
f = open('foo.txt')
s = f.read()
f.close()
for m in re.finditer(regex, s):
doSomething()
Is there a way to do this without having to store the entire file in memory?
NOTE: Reading the file line by line is not an option because the regex can possibly span multiple lines.
UPDATE: I would also like this to work with stdin if possible.
UPDATE: I am considering somehow emulating a string object with a custom file wrapper but I am not sure if the regex functions would accept a custom string-like object.
If you can limit the number of lines that the regex can span to some reasonable number, then you can use a collections.deque to create a rolling window on the file and keep only that number of lines in memory.
from collections import deque
def textwindow(filename, numlines):
with open(filename) as f:
window = deque((f.readline() for i in xrange(numlines)), maxlen=numlines)
nextline = True
while nextline:
text = "".join(window)
yield text
nextline = f.readline()
window.append(nextline)
for text in textwindow("bigfile.txt", 10):
# test to see whether your regex matches and do something
Either you will have to read the file chunk-wise, with overlaps to allow for the maximum possible length of the expression, or use an mmapped file, which will work almost/just as good as using a stream: https://docs.python.org/library/mmap.html
UPDATE to your UPDATE:
consider that stdin isn't a file, it just behaves a lot like one in that it has a file descriptor and so on. it is a posix stream. if you are unclear on the difference, do some googling around. the OS cannot mmap it, therefore python can not.
also consider that what you're doing may be an ill-suited thing to use a regex for. regex's are great for capturing small stuff, like parsing a connection string, a log entry, csv data and so on. they are not a good tool to parse through huge chunks of data. this is by design. you may be better off writing a custom parser.
some words of wisdom from the past:
http://regex.info/blog/2006-09-15/247
Perhaps you could write a function that yields one line (reads one line) at a time of the file and call re.finditer on that until it yields an EOF signal.
Here is another solution, using an internal text buffer to progressively yield found matches without loading the entire file in memory.
This buffer acts like a "sliding windows" through the file text, moving forward while yielding found matches.
As the file content is loaded by chunks, this means this solution works with multilines regexes too.
def find_chunked(fileobj, regex, *, chunk_size=4096):
buffer = ""
while 1:
text = fileobj.read(chunk_size)
buffer += text
matches = list(regex.finditer(buffer))
# End of file, search through remaining final buffer and exit
if not text:
yield from matches
break
# Yield found matches except the last one which is maybe
# incomplete because of the chunk cut (think about '.*')
if len(matches) > 1:
end = matches[-2].end()
buffer = buffer[end:]
yield from matches[:-1]
However, note that it may end up loading the whole file in memory if no matches are found at all, so you better should use this function if you are confident that your file contains the regex pattern many times.

Read file object as string in python

I'm using urllib2 to read in a page. I need to do a quick regex on the source and pull out a few variables but urllib2 presents as a file object rather than a string.
I'm new to python so I'm struggling to see how I use a file object to do this. Is there a quick way to convert this into a string?
You can use Python in interactive mode to search for solutions.
if f is your object, you can enter dir(f) to see all methods and attributes. There's one called read. Enter help(f.read) and it tells you that f.read() is the way to retrieve a string from an file object.
From the doc file.read() (my emphasis):
file.read([size])
Read at most size bytes from the file (less if the read hits EOF before obtaining size bytes). If the size argument is negative or omitted, read all data until EOF is reached. The bytes are returned as a string object. An empty string is returned when EOF is encountered immediately. (For certain files, like ttys, it makes sense to continue reading after an EOF is hit.) Note that this method may call the underlying C function fread more than once in an effort to acquire as close to size bytes as possible. Also note that when in non-blocking mode, less data than was requested may be returned, even if no size parameter was given.
Be aware that a regexp search on a large string object may not be efficient, and consider doing the search line-by-line, using file.next() (a file object is its own iterator).
Michael Foord, aka Voidspace has an excellent tutorial on urllib2 which you can find here:
urllib2 - The Missing Manual
What you are doing should be pretty straightforward, observe this sample code:
import urllib2
import re
response = urllib2.urlopen("http://www.voidspace.org.uk/python/articles/urllib2.shtml")
html = response.read()
pattern = '(V.+space)'
wordPattern = re.compile(pattern, re.IGNORECASE)
results = wordPattern.search(html)
print results.groups()

Categories

Resources