xrange not adhering to start, stop, step - python

My input file looks like:
6
*,b,*
a,*,*
*,*,c
foo,bar,baz
w,x,*,*
*,x,y,z
5
/w/x/y/z/
a/b/c
foo/
foo/bar/
foo/bar/baz/
When I use xrange, why does it not adhere to the start, stop, step method?
with open(sys.argv[1], 'r') as f:
for _ in xrange(0, 7, 1):
next(f)
for listPatterns in f:
print listPatterns.rstrip()
It outputs the text starting at line 7 when in actuality I want it to print line 1 through 7.

The code you want is
with open(sys.argv[1], 'r') as f:
for _ in xrange(0, 7, 1):
print f.next().rstrip()
The first loop you have is advancing through the file.

For each item in the iterable (in this case xrange) you're calling next on the file-obj and ignoring the result - either do something with that result, or better yet, make it much clearer:
from itertools import islice
with open('file') as fin:
for line in islice(fin, 7):
# do something

Er, well, because you told it to skip the first 7 lines? Solution: don't do that.

it isn't xrange.
you first loop through all of xrange.
then you exit the loop
then you have another loop, acting on the last element

You can also iterate through the file, without using a proxy iterator
START_LINE = 0
STOP_LINE = 6
with open(sys.argv[1], 'r') as f:
for i, line in enumerate(f.readlines()):
if START_LINE <= i <= STOP_LINE:
print line.rstrip()
elif i > STOP_LINE:
break

Related

Reorder Lines in a Text File (Loop Assistance)

A have a fairly large text file that I need to reorder the lines such as....
line1
line2
line3
Reordered to look like this
line2
line1
line3
File will continue with more lines and the same reordering must occur. I'm stuck and need a loop to do this. Unfortunately, I have hit a speed bump.
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
for line in fin:
ordering = [1, 0, 2]
for idx in ordering: # Write output lines in the desired order.
fout.write(line)
If the number of lines is a multiple of 3:
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
while True:
try:
in_lines = [next(fin) for _ in range(3)]
fout.write(in_lines[1])
fout.write(in_lines[0])
fout.write(in_lines[2])
except StopIteration:
break
Look at the function grouper in itertools documentation. You then need to do something like:
for lines in grouper(3, fin, '')
for idx in [1, 0, 2]:
fout.write(lines[idx])
You haven't specified what you want to do if the input file doesn't have an exact multiple of 3 lines. This code uses blanks.
You can read in 3 lines at a time.
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
line1 = True
while line1:
line1 = fin.readline()
line2 = fin.readline()
line3 = fin.readline()
fout.write(line2)
fout.write(line1)
fout.write(line3)
For a single line swap the code can be like:
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
for idx, line in enumerate(fin):
if idx == 0:
temp = line # Store for later
elif idx == 1:
fout.write(line) # Write line 1
fout.write(temp) # Write stored line 0
else:
fout.write(line) # Write as is
For repeated swaps the condition can be e. g. idx % 3 == 0 depending on the requirements.
Here's a way to do it that makes use of a couple of Python utilities and a helper function to make things relatively easy. If the number of lines in the file isn't an exact multiple of the length of the group you want to reorder, they're left alone—but that could easily be changed that if you desired.
The grouper() helper function is similar—but not identical—to the recipe with one with same name that's shown in in the itertools documentation.
from itertools import zip_longest
from operator import itemgetter
def grouper(n, iterable):
''' s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ... '''
FILLER = object() # Value which couldn't be in data.
for result in zip_longest(*[iter(iterable)]*n, fillvalue=FILLER):
yield tuple(v for v in result if v is not FILLER)
ordering = 1, 0, 2
reorder = itemgetter(*ordering)
group_len = len(ordering)
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
for group in grouper(group_len, fin):
try:
group = reorder(group)
except IndexError:
pass # Don't reorder potential partial group at end.
fout.writelines(group)

How to read and delete first n lines from file in Python - Elegant Solution[2]

Originally posted here: How to read and delete first n lines from file in Python - Elegant Solution
I have a pretty big file ~ 1MB in size and I would like to be able to read first N lines, save them into a list (newlist) for later use, and then delete them.
My original code was:
import os
n = 3 #the number of line to be read and deleted
with open("bigFile.txt") as f:
mylist = f.read().splitlines()
newlist = mylist[:n]
os.remove("bigFile.txt")
thefile = open('bigFile.txt', 'w')
del mylist[:n]
for item in mylist:
thefile.write("%s\n" % item)
Based on Jean-François Fabre code that was posted and later deleted here I am able to run the following code:
import shutil
n = 3
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
for _ in range(n):
next(f)
f2.writelines(f)
This works great for deleting the first n lines and "updating" the bigFile.txt but when I try to store the first n values into a list so I can later use them like this:
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
mylist = f.read().splitlines()
newlist = mylist[:n]
for _ in range(n):
next(f)
f2.writelines(f)
I get an "StopIteration" error
In your sample code you are reading the entire file to find the first n lines:
# this consumes the entire file
mylist = f.read().splitlines()
This leaves nothing left to read for the subsequent code. Instead simply do:
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
# read the first n lines into newlist
newlist = [f.readline() for _ in range(n)]
f2.writelines(f)
I would proceed as follows:
n = 3
yourlist = []
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
i=0
for line in f:
i += 1
if i<n:
yourlist.append(line)
else:
f2.write(f)

The fastest way to find a string inside a file

I have tried many ways of looking for a string inside a file, but all were slow. All I need is:
look for a string inside a file
print the line on which the string is
All I've been doing until now was reading a file (tried many ways) and then checking whether the string I am looking for is located in the current line. If not, check the next line, etc.
What is the best way to do this?
The following will work for the fist occurrence of a substring something. None is assigned if no match is found; file is read lazily up to the first match.
with open('input.txt') as f:
line = next((l for l in f if something in l), None)
To find all matches, you can use a list comprehension:
with open('input.txt') as f:
lines = [l for l in f if something in l]
I do not know if you can be much faster than this in pure python.
I tried very hard to get a faster version than Yakym's using itertools instead of plain Python iteration. In the end it was still slower. Perhaps someone can come up with a better way.
from itertools import imap, tee, compress, repeat
from time import time
target = 'Language Chooser'
path = '/Users/alexhall/Desktop/test.log'
start = time()
with open(path) as f:
lines1 = [l for l in f if target in l]
print time() - start
# -----
start = time()
with open(path) as f:
f1, f2 = tee(f)
lines2 = list(compress(f1, imap(str.__contains__, f2, repeat(target))))
print time() - start
assert lines1 == lines2
I don't know your target for speed, but this is pretty fast since it defers most of the work to built-in functions.
import string, re, sys
def find_re(fname, regexp):
line_regexp ='^(.*%s.*)$' % regexp
f = open(fname, 'r')
txt = string.join(f.readlines())
matches = re.findall(line_regexp, txt, re.MULTILINE)
for m in matches:
print(m)
def find_slow(fname, regexp):
f = open(fname, 'r')
r = re.compile(regexp)
for li in f.readlines():
if re.search(regexp, li):
print(li),
The 'slow' version is probably what you tried. The other one, find_re, is about twice as fast (0.7 second on searching a 27 MB text file), but still 20x slower than grep.
Python 3 version : (code from Alex Hall)
from itertools import tee, compress, repeat
from time import time
target = '1665588283.688523'
path = 'old.index.log'
start = time()
with open(path) as f:
lines1 = [l for l in f if target in l]
print(time() - start)
# -----
start = time()
with open(path) as f:
f1, f2 = tee(f)
lines2 = list(compress(f1, map(str.__contains__, f2, repeat(target))))
print(time() - start)
assert lines1 == lines2

How to delete line from the file in python

I have a file F, content huge numbers e.g F = [1,2,3,4,5,6,7,8,9,...]. So i want to loop over the file F and delete all lines contain any numbers in file say f = [1,2,4,7,...].
F = open(file)
f = [1,2,4,7,...]
for line in F:
if line.contains(any number in f):
delete the line in F
You can not immediately delete lines in a file, so have to create a new file where you write the remaining lines to. That is what chonws example does.
It's not clear to me what the form of the file you are trying to modify is. I'm going to assume it looks like this:
1,2,3
4,5,7,19
6,2,57
7,8,9
128
Something like this might work for you:
filter = set([2, 9])
lines = open("data.txt").readlines()
outlines = []
for line in lines:
line_numbers = set(int(num) for num in line.split(","))
if not filter & line_numbers:
outlines.append(line)
if len(outlines) < len(lines):
open("data.txt", "w").writelines(outlines)
I've never been clear on what the implications of doing an open() as a one-off are, but I use it pretty regularly, and it doesn't seem to cause any problems.
exclude = set((2, 4, 8)) # is faster to find items in a set
out = open('filtered.txt', 'w')
with open('numbers.txt') as i: # iterates over the lines of a file
for l in i:
if not any((int(x) in exclude for x in l.split(','))):
out.write(l)
out.close()
I'm assuming the file contains only integer numbers separated by ,
Something like this?:
nums = [1, 2]
f = open("file", "r")
source = f.read()
f.close()
out = open("file", "w")
for line in source.splitlines():
found = False
for n in nums:
if line.find(str(n)) > -1:
found = True
break
if found:
continue
out.write(line+"\n")
out.close()

Python how to read N number of lines at a time

I am writing a code to take an enormous textfile (several GB) N lines at a time, process that batch, and move onto the next N lines until I have completed the entire file. (I don't care if the last batch isn't the perfect size).
I have been reading about using itertools islice for this operation. I think I am halfway there:
from itertools import islice
N = 16
infile = open("my_very_large_text_file", "r")
lines_gen = islice(infile, N)
for lines in lines_gen:
...process my lines...
The trouble is that I would like to process the next batch of 16 lines, but I am missing something
islice() can be used to get the next n items of an iterator. Thus, list(islice(f, n)) will return a list of the next n lines of the file f. Using this inside a loop will give you the file in chunks of n lines. At the end of the file, the list might be shorter, and finally the call will return an empty list.
from itertools import islice
with open(...) as f:
while True:
next_n_lines = list(islice(f, n))
if not next_n_lines:
break
# process next_n_lines
An alternative is to use the grouper pattern:
from itertools import zip_longest
with open(...) as f:
for next_n_lines in zip_longest(*[f] * n):
# process next_n_lines
The question appears to presume that there is efficiency to be gained by reading an "enormous textfile" in blocks of N lines at a time. This adds an application layer of buffering over the already highly optimized stdio library, adds complexity, and probably buys you absolutely nothing.
Thus:
with open('my_very_large_text_file') as f:
for line in f:
process(line)
is probably superior to any alternative in time, space, complexity and readability.
See also Rob Pike's first two rules, Jackson's Two Rules, and PEP-20 The Zen of Python. If you really just wanted to play with islice you should have left out the large file stuff.
Here is another way using groupby:
from itertools import count, groupby
N = 16
with open('test') as f:
for g, group in groupby(f, key=lambda _, c=count(): c.next()/N):
print list(group)
How it works:
Basically groupby() will group the lines by the return value of the key parameter and the key parameter is the lambda function lambda _, c=count(): c.next()/N and using the fact that the c argument will be bound to count() when the function will be defined so each time groupby() will call the lambda function and evaluate the return value to determine the grouper that will group the lines so :
# 1 iteration.
c.next() => 0
0 / 16 => 0
# 2 iteration.
c.next() => 1
1 / 16 => 0
...
# Start of the second grouper.
c.next() => 16
16/16 => 1
...
Since the requirement was added that there be statistically uniform distribution of the lines selected from the file, I offer this simple approach.
"""randsamp - extract a random subset of n lines from a large file"""
import random
def scan_linepos(path):
"""return a list of seek offsets of the beginning of each line"""
linepos = []
offset = 0
with open(path) as inf:
# WARNING: CPython 2.7 file.tell() is not accurate on file.next()
for line in inf:
linepos.append(offset)
offset += len(line)
return linepos
def sample_lines(path, linepos, nsamp):
"""return nsamp lines from path where line offsets are in linepos"""
offsets = random.sample(linepos, nsamp)
offsets.sort() # this may make file reads more efficient
lines = []
with open(path) as inf:
for offset in offsets:
inf.seek(offset)
lines.append(inf.readline())
return lines
dataset = 'big_data.txt'
nsamp = 5
linepos = scan_linepos(dataset) # the scan only need be done once
lines = sample_lines(dataset, linepos, nsamp)
print 'selecting %d lines from a file of %d' % (nsamp, len(linepos))
print ''.join(lines)
I tested it on a mock data file of 3 million lines comprising 1.7GB on disk. The scan_linepos dominated the runtime taking about 20 seconds on my not-so-hot desktop.
Just to check the performance of sample_lines I used the timeit module as so
import timeit
t = timeit.Timer('sample_lines(dataset, linepos, nsamp)',
'from __main__ import sample_lines, dataset, linepos, nsamp')
trials = 10 ** 4
elapsed = t.timeit(number=trials)
print u'%dk trials in %.2f seconds, %.2fµs per trial' % (trials/1000,
elapsed, (elapsed/trials) * (10 ** 6))
For various values of nsamp; when nsamp was 100, a single sample_lines completed in 460µs and scaled linearly up to 10k samples at 47ms per call.
The natural next question is Random is barely random at all?, and the answer is "sub-cryptographic but certainly fine for bioinformatics".
Used chunker function from What is the most “pythonic” way to iterate over a list in chunks?:
from itertools import izip_longest
def grouper(iterable, n, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return izip_longest(*args, fillvalue=fillvalue)
with open(filename) as f:
for lines in grouper(f, chunk_size, ""): #for every chunk_sized chunk
"""process lines like
lines[0], lines[1] , ... , lines[chunk_size-1]"""
Assuming "batch" means to want to process all 16 recs at one time instead of individually, read the file one record at a time and update a counter; when the counter hits 16, process that group. interim_list = []
infile = open("my_very_large_text_file", "r")
ctr = 0
for rec in infile:
interim_list.append(rec)
ctr += 1
if ctr > 15:
process_list(interim_list)
interim_list = []
ctr = 0
the final group
process_list(interim_list)
Another solution might be to create an iterator that yields lists of n elements:
def n_elements(n, it):
try:
while True:
yield [next(it) for j in range(0, n)]
except StopIteration:
return
with open(filename, 'rt') as f:
for n_lines in n_elements(n, f):
do_stuff(n_lines)

Categories

Resources