Reading n lines from file (but not all) in Python - python

How to read n lines from a file instead of just one when iterating over it? I have a file which has well defined structure and I would like to do something like this:
for line1, line2, line3 in file:
do_something(line1)
do_something_different(line2)
do_something_else(line3)
but it doesn't work:
ValueError: too many values to unpack
For now I am doing this:
for line in file:
do_someting(line)
newline = file.readline()
do_something_else(newline)
newline = file.readline()
do_something_different(newline)
... etc.
which sucks because I am writing endless 'newline = file.readline()' which are cluttering the code.
Is there any smart way to do this ? (I really want to avoid reading whole file at once because it is huge)

Basically, your fileis an iterator which yields your file one line at a time. This turns your problem into how do you yield several items at a time from an iterator. A solution to that is given in this question. Note that the function isliceis in the itertools module so you will have to import it from there.

If it's xml why not just use lxml?

You could use a helper function like this:
def readnlines(f, n):
lines = []
for x in range(0, n):
lines.append(f.readline())
return lines
Then you can do something like you want:
while True:
line1, line2, line3 = readnlines(file, 3)
do_stuff(line1)
do_stuff(line2)
do_stuff(line3)
That being said, if you are using xml files, you will probably be happier in the long run if you use a real xml parser...

itertools to the rescue:
import itertools
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
fobj= open(yourfile, "r")
for line1, line2, line3 in grouper(3, fobj):
pass

for i in file produces a str, so you can't just do for i, j, k in file and read it in batches of three (try a, b, c = 'bar' and a, b, c = 'too many characters' and look at the values of a, b and c to work out why you get the "too many values to unpack").
It's not clear entirely what you mean, but if you're doing the same thing for each line and just want to stop at some point, then do it like this:
for line in file_handle:
do_something(line)
if some_condition:
break # Don't want to read anything else
(Also, don't use file as a variable name, you're shadowning a builtin.)

If your're doing the same thing why do you need to process multiple lines per iteration?
For line in file is your friend. It is in general much more efficient than manually reading the file, both in terms of io performance and memory.

Do you know something about the length of the lines/format of the data? If so, you could read in the first n bytes (say 80*3) and f.read(240).split("\n")[0:3].

If you want to be able to use this data over and over again, one approach might be to do this:
lines = []
for line in file_handle:
lines.append(line)
This will give you a list of the lines, which you can then access by index. Also, when you say a HUGE file, it is most likely trivial what the size is, because python can process thousands of lines very quickly.

why can't you just do:
ctr = 0
for line in file:
if ctr == 0:
....
elif ctr == 1:
....
ctr = ctr + 1
if you find the if/elif construct ugly you could just create a hash table or list of function pointers and then do:
for line in file:
function_list[ctr]()
or something similar

It sounds like you are trying to read from disk in parallel... that is really hard to do. All the solutions given to you are realistic and legitimate. You shouldn't let something put you off just because the code "looks ugly". The most important thing is how efficient/effective is it, then if the code is messy, you can tidy it up, but don't look for a whole new method of doing something because you don't like how one way of doing it looks like in code.
As for running out of memory, you may want to check out pickle.

It's possible to do it with a clever use of the zip function. It's short, but a bit voodoo-ish for my tastes (hard to see how it works). It cuts off any lines at the end that don't fill a group, which may be good or bad depending on what you're doing. If you need the final lines, itertools.izip_longest might do the trick.
zip(*[iter(inputfile)] * 3)
Doing it more explicitly and flexibly, this is a modification of Mats Ekberg's solution:
def groupsoflines(f, n):
while True:
group = []
for i in range(n):
try:
group.append(next(f))
except StopIteration:
if group:
tofill = n - len(group)
yield group + [None] * tofill
return
yield group
for line1, line2, line3 in groupsoflines(inputfile, 3):
...
N.B. If this runs out of lines halfway through a group, it will fill in the gaps with None, so that you can still unpack it. So, if the number of lines in your file might not be a multiple of three, you'll need to check whether line2 and line3 are None.

Related

Reading a file using range function

I am reading a big file in chunks like
> def gen_data(data):
> for i in range(0, len(data), chunk_sz):
> yield data[i: i + chunk_sz]
If I use length variable instead of len(data) , something like that
length_of_file = len(data)
def gen_data(data):
for i in range(0, length_of_file, chunk_sz):
yield data[i: i + chunk_sz]
What will be the performance improvements for big files. I tested for small one's but didn't see any change.
P.S I am from C/C++ background where calculating in each repetition in while or for loop is a bad practice because it executes for every call.
Use this code to read big file into chunks:
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('really_big_file.dat')
for piece in read_in_chunks(f):
process_data(piece)
Another option using iter
f = open('really_big_file.dat')
def read1k():
return f.read(1024)
for piece in iter(read1k, ''):
process_data(piece)
Python's for loop are not C for loop, but really foreach kind of loops. In your example:
for i in range(0, len(data), chunk_sz):
range() is only called once, then python iterates on the return value (a list in python2, an iterable range object in python3). IOW, from this POV, your snippets are equivalent - the difference being that the second snippet uses a non-local variable length_of_file, so you actually get a performance hit from resolving it.
I am from C/C++ background where calculating in each repetition in while or for loop is a bad practice because it executes for every call
Eventual compiler optimizations set asides, this is true for most if not all languages.
This being said and as other already mentionned in comments or answers: this is not how you read a file in chunks - you want SurajM's first snippet.

Extraction and processing the data from txt file

I am beginner in python (also in programming)I have a larg file containing repeating 3 lines with numbers 1 empty line and again...
if I print the file it looks like:
1.93202838
1.81608154
1.50676177
2.35787777
1.51866227
1.19643624
...
I want to take each three numbers - so that it is one vector, make some math operations with them and write them back to a new file and move to another three lines - to another vector.so here is my code (doesnt work):
import math
inF = open("data.txt", "r+")
outF = open("blabla.txt", "w")
a = []
fin = []
b = []
for line in inF:
a.append(line)
if line.startswith(" \n"):
fin.append(b)
h1 = float(fin[0])
k2 = float(fin[1])
l3 = float(fin[2])
h = h1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
k = k1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
l = l1/(math.sqrt(h1*h1+k1*k1+l1*l1)+1)
vector = [str(h), str(k), str(l)]
outF.write('\n'.join(vector)
b = a
a = []
inF.close()
outF.close()
print "done!"
I want to get "vector" from each 3 lines in my file and put it into blabla.txt output file. Thanks a lot!
My 'code comment' answer:
take care to close all parenthesis, in order to match the opened ones! (this is very likely to raise SyntaxError ;-) )
fin is created as an empty list, and is never filled. Trying to call any value by fin[n] is therefore very likely to break with an IndexError;
k2 and l3 are created but never used;
k1 and l1 are not created but used, this is very likely to break with a NameError;
b is created as a copy of a, so is a list. But you do a fin.append(b): what do you expect in this case by appending (not extending) a list?
Hope this helps!
This is only in the answers section for length and formatting.
Input and output.
Control flow
I know nothing of vectors, you might want to look into the Math module or NumPy.
Those links should hopefully give you all the information you need to at least get started with this problem, as yuvi said, the code won't be written for you but you can come back when you have something that isn't working as you expected or you don't fully understand.

python similar string removal from multiple files

I have crawled txt files from different website, now i need to glue them into one file. There are many lines are similar to each other from various websites. I want to remove repetitions.
Here is what I have tried:
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
sourcefile = open('%s.txt' % sourcename)
sourcelines = sourcefile.readlines()
sourcefile.close()
for sourceline in sourcelines:
destfile = open('%s.txt' % destname, 'a+')
destlines = destfile.readlines()
similar = False
for destline in destlines:
ratio = difflib.SequenceMatcher(None, destline, sourceline).ratio()
if ratio > 0.8:
print destline
print sourceline
similar = True
if not similar:
destfile.write(sourceline)
destfile.close()
I will run it for every source, and write line by line to the same file. The result is, even if i run it for the same file multiple times, the line is always appended to the destination file.
EDIT:
I have tried the code of the answer. It's still very slow.
Even If I minimize the IO, I still need to compare O(n^2), especially when you have 1000+ lines. I have average 10,000 lines per file.
Any other ways to remove the duplicates?
Here is a short version that does minimal IO and cleans up after itself.
import difflib
sourcename = 'xiaoshanwujzw'
destname = 'bindresult'
with open('%s.txt' % destname, 'w+') as destfile:
# we read in the file so that on subsequent runs of this script, we
# won't duplicate the lines.
known_lines = set(destfile.readlines())
with open('%s.txt' % sourcename) as sourcefile:
for line in sourcefile:
similar = False
for known in known_lines:
ratio = difflib.SequenceMatcher(None, line, known).ratio()
if ratio > 0.8:
print ratio
print line
print known
similar = True
break
if not similar:
destfile.write(line)
known_lines.add(line)
Instead of reading the known lines each time from the file, we save them to a set, which we use for comparison against. The set is essentially a mirror of the contents of 'destfile'.
A note on complexity
By its very nature, this problem has a O(n2) complexity. Because you're looking for similarity with known strings, rather than identical strings, you have to look at every previously seen string. If you were looking to remove exact duplicates, rather than fuzzy matches, you could use a simple lookup in a set, with complexity O(1), making your entire solution have O(n) complexity.
There might be a way to reduce the fundamental complexity by using lossy compression on the strings so that two similar strings compress to the same result. This is however both out of scope for a stack overflow answer, and beyond my expertise. It is an active research area so you might have some luck digging through the literature.
You could also reduce the time taken by ratio() by using the less accurate alternatives quick_ratio() and real_quick_ratio().
Your code works fine for me. it prints destline and sourceline to stdout when lines are similar (in the example I used, exactly the same) but it only wrote unique lines to file once. You might need to set your ratio threshold lower for your specific "similarity" needs.
Basically what you need to do is check every line in the source file to see if it has a potential match against every line of the destination file.
##xiaoshanwujzw.txt
##-----------------
##radically different thing
##this is data
##and more data
##bindresult.txt
##--------------
##a website line
##this is data
##and more data
from difflib import SequenceMatcher
sourcefile = open('xiaoshanwujzw.txt', 'r')
sourcelines = sourcefile.readlines()
sourcefile.close()
destfile = open('bindresult.txt', 'a+')
destlines = destfile.readlines()
has_matches = {k: False for k in sourcelines}
for d_line in destlines:
for s_line in sourcelines:
if SequenceMatcher(None, d_line, s_line).ratio() > 0.8:
has_matches[s_line] = True
break
for k in has_matches:
if has_matches[k] == False:
destfile.write(k)
destfile.close()
This will add the line radically different thing`` to the destinationfile.

How to Compare 2 very large matrices using Python

I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.
Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.
This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))

Python: item for item until stopterm in item?

Disclaimer: I'm fairly new to python!
If I want all the lines of a file until (edit: and including) the line containing some string stopterm, is there a way of using the list syntax for it? I was hoping there would be something like:
usefullines = [line for line in file until stopterm in line]
For now, I've got
usefullines = []
for line in file:
usefullines.append(line)
if stopterm in line:
break
It's not the end of the world, but since there rest of Python syntax is so straightforward, I was hoping for a 1 thought->1 Python line mapping.
from itertools import takewhile
usefullines = takewhile(lambda x: not re.search(stopterm, x), lines)
from itertools import takewhile
usefullines = takewhile(lambda x: stopterm not in x, lines)
Here's a way that keeps the stopterm line:
def useful_lines(lines, stopterm):
for line in lines:
if stopterm in line:
yield line
break
yield line
usefullines = useful_lines(lines, stopterm)
# or...
for line in useful_lines(lines, stopterm):
# ... do stuff
pass
" I was hoping for a 1 thought->1 Python line mapping." Wouldn't we all love a programming language that somehow mirrored our natural language?
You can achieve that, you just need to define your unique thoughts once. Then you have the 1:1 mapping you were hoping for.
def usefulLines( aFile ):
for line in aFile:
yield line
if line == stopterm:
break
Is pretty much it.
for line in usefulLines( aFile ):
# process a line, knowing it occurs BEFORE stopterm.
There are more general approaches. The lassevk answers with enum_while and enum_until are generalizations of this simple design pattern.
That itertools solution is neat. I have earlier been amazed by itertools.groupby, one handy tool.
But still i was just tinkering if I could do this without itertools. So here it is
(There is one assumption and one drawback though: the file is not huge and its goes for one extra complete iteration over the lines, respectively.)
I created a sample file named "try":
hello
world
happy
day
bye
once you read the file and have the lines in a variable name lines:
lines=open('./try').readlines()
then
print [each for each in lines if lines.index(each)<=[lines.index(line) for line in lines if 'happy' in line][0]]
gives the result:
['hello\n', 'world\n', 'happy\n']
and
print [each for each in lines if lines.index(each)<=[lines.index(line) for line in lines if 'day' in line][0]]
gives the result:
['hello\n', 'world\n', 'happy\n', 'day\n']
So you got the last line - the stop term line also included.
Forget this
Leaving the answer, but marking it community. See Stewen Huwig's answer for the correct way to do this.
Well, [x for x in enumerable] will run until enumerable doesn't produce data any more, the if-part will simply allow you to filter along the way.
What you can do is add a function, and filter your enumerable through it:
def enum_until(source, until_criteria):
for k in source:
if until_criteria(k):
break;
yield k;
def enum_while(source, while_criteria):
for k in source:
if not while_criteria(k):
break;
yield k;
l1 = [k for k in enum_until(xrange(1, 100000), lambda y: y == 100)];
l2 = [k for k in enum_while(xrange(1, 100000), lambda y: y < 100)];
print l1;
print l2;
Of course, it doesn't look as nice as what you wanted...
I think it's fine to keep it that way. Sophisticated one-liner are not really pythonic, and since Guido had to put a limit somewhere, I guess this is it...
I'd go with Steven Huwig's or S.Lott's solutions for real usage, but as a slightly hacky solution, here's one way to obtain this behaviour:
def stop(): raise StopIteration()
usefullines = list(stop() if stopterm in line else line for line in file)
It's slightly abusing the fact that anything that raises StopIteration will abort the current iteration (here the generator expression) and uglier to read than your desired syntax, but will work.

Categories

Resources