Reorder Lines in a Text File (Loop Assistance) - python

A have a fairly large text file that I need to reorder the lines such as....
line1
line2
line3
Reordered to look like this
line2
line1
line3
File will continue with more lines and the same reordering must occur. I'm stuck and need a loop to do this. Unfortunately, I have hit a speed bump.
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
for line in fin:
ordering = [1, 0, 2]
for idx in ordering: # Write output lines in the desired order.
fout.write(line)

If the number of lines is a multiple of 3:
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
while True:
try:
in_lines = [next(fin) for _ in range(3)]
fout.write(in_lines[1])
fout.write(in_lines[0])
fout.write(in_lines[2])
except StopIteration:
break

Look at the function grouper in itertools documentation. You then need to do something like:
for lines in grouper(3, fin, '')
for idx in [1, 0, 2]:
fout.write(lines[idx])
You haven't specified what you want to do if the input file doesn't have an exact multiple of 3 lines. This code uses blanks.

You can read in 3 lines at a time.
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
line1 = True
while line1:
line1 = fin.readline()
line2 = fin.readline()
line3 = fin.readline()
fout.write(line2)
fout.write(line1)
fout.write(line3)

For a single line swap the code can be like:
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
for idx, line in enumerate(fin):
if idx == 0:
temp = line # Store for later
elif idx == 1:
fout.write(line) # Write line 1
fout.write(temp) # Write stored line 0
else:
fout.write(line) # Write as is
For repeated swaps the condition can be e. g. idx % 3 == 0 depending on the requirements.

Here's a way to do it that makes use of a couple of Python utilities and a helper function to make things relatively easy. If the number of lines in the file isn't an exact multiple of the length of the group you want to reorder, they're left alone—but that could easily be changed that if you desired.
The grouper() helper function is similar—but not identical—to the recipe with one with same name that's shown in in the itertools documentation.
from itertools import zip_longest
from operator import itemgetter
def grouper(n, iterable):
''' s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ... '''
FILLER = object() # Value which couldn't be in data.
for result in zip_longest(*[iter(iterable)]*n, fillvalue=FILLER):
yield tuple(v for v in result if v is not FILLER)
ordering = 1, 0, 2
reorder = itemgetter(*ordering)
group_len = len(ordering)
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
for group in grouper(group_len, fin):
try:
group = reorder(group)
except IndexError:
pass # Don't reorder potential partial group at end.
fout.writelines(group)

Related

replace with value from array Python

I create txt file with my name in 3 lines of strings :
adam1
adam2
adam3
and array
array = ['Tom','Monica','Jean']
I wanna to replace "adam1" with "Tom" from array and "adam2" with "Monica" etc .
import string
s = open("test.txt",'r')
array = ["Tom''Monica','Jean']
I start code but i dont know how create for loop to do this with replace() method. Can anybody help?
with open('test.txt') as fin:
lst = list(map(lambda s: s.strip(), fin))
with open('test.txt', 'w') as fout:
lst[:len(array)] = array
for elem in lst:
fout.write(str(elem) + '\n')
with open('test_input.txt') as fin, open('test_output.txt','w') as fout:
for num, line in enumerate(fin,0):
fout.write(replace_array[num]+'\n')

How to read and delete first n lines from file in Python - Elegant Solution[2]

Originally posted here: How to read and delete first n lines from file in Python - Elegant Solution
I have a pretty big file ~ 1MB in size and I would like to be able to read first N lines, save them into a list (newlist) for later use, and then delete them.
My original code was:
import os
n = 3 #the number of line to be read and deleted
with open("bigFile.txt") as f:
mylist = f.read().splitlines()
newlist = mylist[:n]
os.remove("bigFile.txt")
thefile = open('bigFile.txt', 'w')
del mylist[:n]
for item in mylist:
thefile.write("%s\n" % item)
Based on Jean-François Fabre code that was posted and later deleted here I am able to run the following code:
import shutil
n = 3
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
for _ in range(n):
next(f)
f2.writelines(f)
This works great for deleting the first n lines and "updating" the bigFile.txt but when I try to store the first n values into a list so I can later use them like this:
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
mylist = f.read().splitlines()
newlist = mylist[:n]
for _ in range(n):
next(f)
f2.writelines(f)
I get an "StopIteration" error
In your sample code you are reading the entire file to find the first n lines:
# this consumes the entire file
mylist = f.read().splitlines()
This leaves nothing left to read for the subsequent code. Instead simply do:
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
# read the first n lines into newlist
newlist = [f.readline() for _ in range(n)]
f2.writelines(f)
I would proceed as follows:
n = 3
yourlist = []
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
i=0
for line in f:
i += 1
if i<n:
yourlist.append(line)
else:
f2.write(f)

Increase Python code speed

Does anyone have a clue how to increase the speed of this part of python code?
It was designed to deal with small files (with just a few lines, and for this is very fast) but i want to run it with big files (with ~50Gb, and millions of lines).
The main goal of this code is to get stings from a file (.txt) and search for these in a input file printing the the number of times that these occurred in the output file.
Here is the code: infile, seqList and out are determined by the optparse as Options in the beginning of the code (not shown)
def novo (infile, seqList, out) :
uDic = dict()
rDic = dict()
nmDic = dict()
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
for k in lines:
l1 = k.split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
if l1[4] == "U":
for k in uDic.keys():
if k == l3:
uDic[k] += 1
if l1[4] == "R":
for j in rDic.keys():
if j == l3:
rDic[j] += 1
if l1[4] == "NM":
for h in nmDic.keys():
if h == l3:
nmDic[h] += 1
f = open(out, "w")
f.write("Sample"+"\t"+"R"+"\t"+"U"+"\t"+"NM"+"\t"+"TOTAL"+"\t"+"%R"+"\t"+"%U"+"\t"+"%NM"+"\n")
for i in samples:
U = int()
R = int()
NM = int ()
for k, j in uDic.items():
if k == i:
U = j
for o, p in rDic.items():
if o == i:
R = p
for y,u in nmDic.items():
if y == i:
NM = u
TOTAL = int(U + R + NM)
try:
f.write(i+"\t"+str(R)+"\t"+str(U)+"\t"+str(NM)+"\t"+str(TOTAL)+"\t"+str(float(R) / TOTAL)+"\t"+str(float(U) / TOTAL)+"\t"+str(float(NM) / TOTAL$
except:
continue
f.close()
With processing 50 GB files, the question is not how to make it faster, but how to make it runnable
at all.
The main problem is, you will run out of memory and shall modify the code to be processing the files
without having all the files in memory, but rather having in memory onle a line, which is needed.
Following code from your question is reading all the lines form two files:
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
# at this moment you are likely to run out of memory already
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
#similar loop over `lines` comes later on
You shall defer reading the lines till the latest possible moment like this:
#Create dictionaires with all the samples
with open(seqList, 'r') as RADlist:
for samplelines in RADlist:
sample = sampleline.strip()
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
Note: did you want to use line.strip() or line.split()?
This way, you do not have to keep all the content in memory.
There are many more options for optimization, but this one will let you to take off and run.
It would make it much easier if you provided some sample inputs. Because you haven't I haven't tested this, but the idea is simple - iterate through each file only once, using iterators rather than reading the whole file into memory. Use the efficient collections.Counter object to handle the counting and minimise inner looping:
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
If you convert your script into functions (it makes profiling easier), and then see what it does when you code profile it : I suggest using runsnake : runsnakerun
I would try replacing your loops with list & dictionary comprehensions:
For example, instead of
for i in samples:
uDict[i.replace(" ","")] = 0
Try:
udict = {i.replace(" ",""):0 for i in samples}
and similarly for the other dicts
I don't really follow what's going on in your "for k in lines" loop, but you only need l3 (and l2) when you have certain values for l1[4]. Why not check for those values before splitting and replacing?
Lastly, instead of looping through all the keys of a dict to see if a given element is in that dict, try:
if x in myDict:
myDict[x] = ....
for example:
for k in uDic.keys():
if k == l3:
uDic[k] += 1
can be replaced with:
if l3 in uDic:
uDic[l3] += 1
Other than that, try profiling.
1)Look into profilers and adjust the code that is taking the most time.
2)You could try optimizing some methods with Cython - use data from profiler to modify correct thing
3)It looks like you can use a counter instead of a dict for the output file, and a set for the input file -look into them.
set = set()
from Collections import Counter
counter = Counter() # Essentially a modified dict, that is optimized for counting...
# like counting occurences of strings in a text file
4) If you are reading 50GB of memory you won't be able to store it all in RAM (I'm assuming who knows what kind of computer you have), so generators should save your memory and time.
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)

xrange not adhering to start, stop, step

My input file looks like:
6
*,b,*
a,*,*
*,*,c
foo,bar,baz
w,x,*,*
*,x,y,z
5
/w/x/y/z/
a/b/c
foo/
foo/bar/
foo/bar/baz/
When I use xrange, why does it not adhere to the start, stop, step method?
with open(sys.argv[1], 'r') as f:
for _ in xrange(0, 7, 1):
next(f)
for listPatterns in f:
print listPatterns.rstrip()
It outputs the text starting at line 7 when in actuality I want it to print line 1 through 7.
The code you want is
with open(sys.argv[1], 'r') as f:
for _ in xrange(0, 7, 1):
print f.next().rstrip()
The first loop you have is advancing through the file.
For each item in the iterable (in this case xrange) you're calling next on the file-obj and ignoring the result - either do something with that result, or better yet, make it much clearer:
from itertools import islice
with open('file') as fin:
for line in islice(fin, 7):
# do something
Er, well, because you told it to skip the first 7 lines? Solution: don't do that.
it isn't xrange.
you first loop through all of xrange.
then you exit the loop
then you have another loop, acting on the last element
You can also iterate through the file, without using a proxy iterator
START_LINE = 0
STOP_LINE = 6
with open(sys.argv[1], 'r') as f:
for i, line in enumerate(f.readlines()):
if START_LINE <= i <= STOP_LINE:
print line.rstrip()
elif i > STOP_LINE:
break

How to delete line from the file in python

I have a file F, content huge numbers e.g F = [1,2,3,4,5,6,7,8,9,...]. So i want to loop over the file F and delete all lines contain any numbers in file say f = [1,2,4,7,...].
F = open(file)
f = [1,2,4,7,...]
for line in F:
if line.contains(any number in f):
delete the line in F
You can not immediately delete lines in a file, so have to create a new file where you write the remaining lines to. That is what chonws example does.
It's not clear to me what the form of the file you are trying to modify is. I'm going to assume it looks like this:
1,2,3
4,5,7,19
6,2,57
7,8,9
128
Something like this might work for you:
filter = set([2, 9])
lines = open("data.txt").readlines()
outlines = []
for line in lines:
line_numbers = set(int(num) for num in line.split(","))
if not filter & line_numbers:
outlines.append(line)
if len(outlines) < len(lines):
open("data.txt", "w").writelines(outlines)
I've never been clear on what the implications of doing an open() as a one-off are, but I use it pretty regularly, and it doesn't seem to cause any problems.
exclude = set((2, 4, 8)) # is faster to find items in a set
out = open('filtered.txt', 'w')
with open('numbers.txt') as i: # iterates over the lines of a file
for l in i:
if not any((int(x) in exclude for x in l.split(','))):
out.write(l)
out.close()
I'm assuming the file contains only integer numbers separated by ,
Something like this?:
nums = [1, 2]
f = open("file", "r")
source = f.read()
f.close()
out = open("file", "w")
for line in source.splitlines():
found = False
for n in nums:
if line.find(str(n)) > -1:
found = True
break
if found:
continue
out.write(line+"\n")
out.close()

Categories

Resources