I have a file F, content huge numbers e.g F = [1,2,3,4,5,6,7,8,9,...]. So i want to loop over the file F and delete all lines contain any numbers in file say f = [1,2,4,7,...].
F = open(file)
f = [1,2,4,7,...]
for line in F:
if line.contains(any number in f):
delete the line in F
You can not immediately delete lines in a file, so have to create a new file where you write the remaining lines to. That is what chonws example does.
It's not clear to me what the form of the file you are trying to modify is. I'm going to assume it looks like this:
1,2,3
4,5,7,19
6,2,57
7,8,9
128
Something like this might work for you:
filter = set([2, 9])
lines = open("data.txt").readlines()
outlines = []
for line in lines:
line_numbers = set(int(num) for num in line.split(","))
if not filter & line_numbers:
outlines.append(line)
if len(outlines) < len(lines):
open("data.txt", "w").writelines(outlines)
I've never been clear on what the implications of doing an open() as a one-off are, but I use it pretty regularly, and it doesn't seem to cause any problems.
exclude = set((2, 4, 8)) # is faster to find items in a set
out = open('filtered.txt', 'w')
with open('numbers.txt') as i: # iterates over the lines of a file
for l in i:
if not any((int(x) in exclude for x in l.split(','))):
out.write(l)
out.close()
I'm assuming the file contains only integer numbers separated by ,
Something like this?:
nums = [1, 2]
f = open("file", "r")
source = f.read()
f.close()
out = open("file", "w")
for line in source.splitlines():
found = False
for n in nums:
if line.find(str(n)) > -1:
found = True
break
if found:
continue
out.write(line+"\n")
out.close()
Related
A have a fairly large text file that I need to reorder the lines such as....
line1
line2
line3
Reordered to look like this
line2
line1
line3
File will continue with more lines and the same reordering must occur. I'm stuck and need a loop to do this. Unfortunately, I have hit a speed bump.
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
for line in fin:
ordering = [1, 0, 2]
for idx in ordering: # Write output lines in the desired order.
fout.write(line)
If the number of lines is a multiple of 3:
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
while True:
try:
in_lines = [next(fin) for _ in range(3)]
fout.write(in_lines[1])
fout.write(in_lines[0])
fout.write(in_lines[2])
except StopIteration:
break
Look at the function grouper in itertools documentation. You then need to do something like:
for lines in grouper(3, fin, '')
for idx in [1, 0, 2]:
fout.write(lines[idx])
You haven't specified what you want to do if the input file doesn't have an exact multiple of 3 lines. This code uses blanks.
You can read in 3 lines at a time.
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
line1 = True
while line1:
line1 = fin.readline()
line2 = fin.readline()
line3 = fin.readline()
fout.write(line2)
fout.write(line1)
fout.write(line3)
For a single line swap the code can be like:
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
for idx, line in enumerate(fin):
if idx == 0:
temp = line # Store for later
elif idx == 1:
fout.write(line) # Write line 1
fout.write(temp) # Write stored line 0
else:
fout.write(line) # Write as is
For repeated swaps the condition can be e. g. idx % 3 == 0 depending on the requirements.
Here's a way to do it that makes use of a couple of Python utilities and a helper function to make things relatively easy. If the number of lines in the file isn't an exact multiple of the length of the group you want to reorder, they're left alone—but that could easily be changed that if you desired.
The grouper() helper function is similar—but not identical—to the recipe with one with same name that's shown in in the itertools documentation.
from itertools import zip_longest
from operator import itemgetter
def grouper(n, iterable):
''' s -> (s0,s1,...sn-1), (sn,sn+1,...s2n-1), (s2n,s2n+1,...s3n-1), ... '''
FILLER = object() # Value which couldn't be in data.
for result in zip_longest(*[iter(iterable)]*n, fillvalue=FILLER):
yield tuple(v for v in result if v is not FILLER)
ordering = 1, 0, 2
reorder = itemgetter(*ordering)
group_len = len(ordering)
with open('poop.txt') as fin, open('clean330.txt', 'w') as fout:
for group in grouper(group_len, fin):
try:
group = reorder(group)
except IndexError:
pass # Don't reorder potential partial group at end.
fout.writelines(group)
I'm looking for a way to read from two large files simultaneously without bringing the whole data into memory. I want to parse M lines from the first file with N lines from the second file. Is there any wise and memory efficient solution for it?
So far I know how to do it with reading two files at the same time line by line. But I don't know if it would be possible to extend this code to read for example 4 lines from the first file, and 1 line from the second file.
from itertools import izip
with open("textfile1") as textfile1, open("textfile2") as textfile2:
for x, y in izip(textfile1, textfile2):
x = x.strip()
y = y.strip()
print("{0}\t{1}".format(x, y))
from here, Read two textfile line by line simultaneously -python
Just open the files and use e.g. line = textfile1.readline() to read a line from one of the files.
line will contain a trailing newline. You see that you reached the end, when an empty string is returned.
That would read the next n lines from file1 then next m lines from file2 within some other code
def nextlines(number, file):
n_items = []
index = number
while(index > 0):
try:
n_items += [next(file).strip()]
index -= 1
except StopIteration:
break
return n_items
n = 5
m = 7
l1 = []
l2 = []
with open('file.dat', 'r') as file1:
with open('file.dat', 'r') as file2:
#some code
l1 = nextlines(n, file1)
l2 = nextlines(m, file2)
#some other code
file2.close()
file1.close()
print l1
print l2
Originally posted here: How to read and delete first n lines from file in Python - Elegant Solution
I have a pretty big file ~ 1MB in size and I would like to be able to read first N lines, save them into a list (newlist) for later use, and then delete them.
My original code was:
import os
n = 3 #the number of line to be read and deleted
with open("bigFile.txt") as f:
mylist = f.read().splitlines()
newlist = mylist[:n]
os.remove("bigFile.txt")
thefile = open('bigFile.txt', 'w')
del mylist[:n]
for item in mylist:
thefile.write("%s\n" % item)
Based on Jean-François Fabre code that was posted and later deleted here I am able to run the following code:
import shutil
n = 3
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
for _ in range(n):
next(f)
f2.writelines(f)
This works great for deleting the first n lines and "updating" the bigFile.txt but when I try to store the first n values into a list so I can later use them like this:
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
mylist = f.read().splitlines()
newlist = mylist[:n]
for _ in range(n):
next(f)
f2.writelines(f)
I get an "StopIteration" error
In your sample code you are reading the entire file to find the first n lines:
# this consumes the entire file
mylist = f.read().splitlines()
This leaves nothing left to read for the subsequent code. Instead simply do:
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
# read the first n lines into newlist
newlist = [f.readline() for _ in range(n)]
f2.writelines(f)
I would proceed as follows:
n = 3
yourlist = []
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
i=0
for line in f:
i += 1
if i<n:
yourlist.append(line)
else:
f2.write(f)
Does anyone have a clue how to increase the speed of this part of python code?
It was designed to deal with small files (with just a few lines, and for this is very fast) but i want to run it with big files (with ~50Gb, and millions of lines).
The main goal of this code is to get stings from a file (.txt) and search for these in a input file printing the the number of times that these occurred in the output file.
Here is the code: infile, seqList and out are determined by the optparse as Options in the beginning of the code (not shown)
def novo (infile, seqList, out) :
uDic = dict()
rDic = dict()
nmDic = dict()
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
for k in lines:
l1 = k.split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
if l1[4] == "U":
for k in uDic.keys():
if k == l3:
uDic[k] += 1
if l1[4] == "R":
for j in rDic.keys():
if j == l3:
rDic[j] += 1
if l1[4] == "NM":
for h in nmDic.keys():
if h == l3:
nmDic[h] += 1
f = open(out, "w")
f.write("Sample"+"\t"+"R"+"\t"+"U"+"\t"+"NM"+"\t"+"TOTAL"+"\t"+"%R"+"\t"+"%U"+"\t"+"%NM"+"\n")
for i in samples:
U = int()
R = int()
NM = int ()
for k, j in uDic.items():
if k == i:
U = j
for o, p in rDic.items():
if o == i:
R = p
for y,u in nmDic.items():
if y == i:
NM = u
TOTAL = int(U + R + NM)
try:
f.write(i+"\t"+str(R)+"\t"+str(U)+"\t"+str(NM)+"\t"+str(TOTAL)+"\t"+str(float(R) / TOTAL)+"\t"+str(float(U) / TOTAL)+"\t"+str(float(NM) / TOTAL$
except:
continue
f.close()
With processing 50 GB files, the question is not how to make it faster, but how to make it runnable
at all.
The main problem is, you will run out of memory and shall modify the code to be processing the files
without having all the files in memory, but rather having in memory onle a line, which is needed.
Following code from your question is reading all the lines form two files:
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
# at this moment you are likely to run out of memory already
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
#similar loop over `lines` comes later on
You shall defer reading the lines till the latest possible moment like this:
#Create dictionaires with all the samples
with open(seqList, 'r') as RADlist:
for samplelines in RADlist:
sample = sampleline.strip()
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
Note: did you want to use line.strip() or line.split()?
This way, you do not have to keep all the content in memory.
There are many more options for optimization, but this one will let you to take off and run.
It would make it much easier if you provided some sample inputs. Because you haven't I haven't tested this, but the idea is simple - iterate through each file only once, using iterators rather than reading the whole file into memory. Use the efficient collections.Counter object to handle the counting and minimise inner looping:
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
If you convert your script into functions (it makes profiling easier), and then see what it does when you code profile it : I suggest using runsnake : runsnakerun
I would try replacing your loops with list & dictionary comprehensions:
For example, instead of
for i in samples:
uDict[i.replace(" ","")] = 0
Try:
udict = {i.replace(" ",""):0 for i in samples}
and similarly for the other dicts
I don't really follow what's going on in your "for k in lines" loop, but you only need l3 (and l2) when you have certain values for l1[4]. Why not check for those values before splitting and replacing?
Lastly, instead of looping through all the keys of a dict to see if a given element is in that dict, try:
if x in myDict:
myDict[x] = ....
for example:
for k in uDic.keys():
if k == l3:
uDic[k] += 1
can be replaced with:
if l3 in uDic:
uDic[l3] += 1
Other than that, try profiling.
1)Look into profilers and adjust the code that is taking the most time.
2)You could try optimizing some methods with Cython - use data from profiler to modify correct thing
3)It looks like you can use a counter instead of a dict for the output file, and a set for the input file -look into them.
set = set()
from Collections import Counter
counter = Counter() # Essentially a modified dict, that is optimized for counting...
# like counting occurences of strings in a text file
4) If you are reading 50GB of memory you won't be able to store it all in RAM (I'm assuming who knows what kind of computer you have), so generators should save your memory and time.
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)
I am trying to read a data file into a 2 dimensional array.
For example:
file.dat:
1 2 3 a
4 5 6 b
7 8 9 c
I tried something like:
file=open("file.dat","r")
var = [[]]
var.append([j for j in i.split()] for i in file)
but that didn't work.
I need the data in form of two dimensional array as I need to do operations with each element afterwards, e.g.
for k in range(3):
newval(k) = var[k,1]
Any idea how to do that?
file = open("file.dat", "r") # open file for reading
var = [] # initialize empty array
for line in file:
var.append(line.strip().split(' ')) # split each line on the <space>, and turn it into an array
# thus creating an array of arrays.
file.close() # close file.
This worked for me:
with open("/path/to/file", 'r') as f:
lines = [[float(n) for n in line.strip().split(' ')] for line in f]
It's really weird people said in the comments there is no one line solution for this. It took me so little testing to make this work.
v = []
with open("data.txt", 'r') as file:
for line in file:
if line.split():
line = [float(x) for x in line.split()]
v.append(line)
print(v)