Reading simultaneously different number of lines from two files - python

I'm looking for a way to read from two large files simultaneously without bringing the whole data into memory. I want to parse M lines from the first file with N lines from the second file. Is there any wise and memory efficient solution for it?
So far I know how to do it with reading two files at the same time line by line. But I don't know if it would be possible to extend this code to read for example 4 lines from the first file, and 1 line from the second file.
from itertools import izip
with open("textfile1") as textfile1, open("textfile2") as textfile2:
for x, y in izip(textfile1, textfile2):
x = x.strip()
y = y.strip()
print("{0}\t{1}".format(x, y))
from here, Read two textfile line by line simultaneously -python

Just open the files and use e.g. line = textfile1.readline() to read a line from one of the files.
line will contain a trailing newline. You see that you reached the end, when an empty string is returned.

That would read the next n lines from file1 then next m lines from file2 within some other code
def nextlines(number, file):
n_items = []
index = number
while(index > 0):
try:
n_items += [next(file).strip()]
index -= 1
except StopIteration:
break
return n_items
n = 5
m = 7
l1 = []
l2 = []
with open('file.dat', 'r') as file1:
with open('file.dat', 'r') as file2:
#some code
l1 = nextlines(n, file1)
l2 = nextlines(m, file2)
#some other code
file2.close()
file1.close()
print l1
print l2

Related

python store a split list into multiple files and items in same line

I have a huge list with 9000 items. I already referred this post here and here. Don't mark it as duplicate
Mylist = [1234,45678,2314,65474,412,87986,21321,4324,68768,1133,712421,12132,0898]
I would like to split my list and store the output of each list in a notepad file
For ex: I wish each of my output list to contain ~10% of items from the original Mylist
So, I tried the below
for k,g in itertools.groupby(Mylist, lambda x: x/10):
with open("part1.txt", 'w') as file:
file.write('\n'.join(yourList))
I expect my output to have multiple text files like below and each file should contain 10% of items stored like below in screenshot from original list
part1.txt
part2.txt
part3.txt
part4.txt
No need for groupby, a simple loop with slicing is sufficient. You need to decide how to handle the extra items (add to the last list or add an extra file):
Mylist = [1234,45678,2314,65474,412,87986,21321,4324,68768,1133,712421,12132,898]
N = 3 # use 10 in your real life example
step = len(Mylist)//N
start = 0
for i, stop in enumerate(range(step, len(Mylist)+step, step)):
print(f'file{i}')
print(Mylist[start:stop]) # save to file here instead
start = stop
output:
file0
[1234, 45678, 2314, 65474]
file1
[412, 87986, 21321, 4324]
file2
[68768, 1133, 712421, 12132]
file3
[898]
Variant for adding to last file:
Mylist = [1234,45678,2314,65474,412,87986,21321,4324,68768,1133,712421,12132,898]
N = 3
step = len(Mylist)//N
start = 0
for i, stop in enumerate(range(step, len(Mylist), step)):
print(f'file{i}')
if i+1 == N:
stop = len(Mylist)
print(Mylist[start:stop]) # save to file here instead
start = stop
output:
file0
[1234, 45678, 2314, 65474]
file1
[412, 87986, 21321, 4324]
file2
[68768, 1133, 712421, 12132, 898]
saving to file
Mylist = [1234,45678,2314,65474,412,87986,21321,4324,68768,1133,712421,12132,898]
N = 3
step = len(Mylist)//N
start = 0
for i, stop in enumerate(range(step, len(Mylist), step), start=1):
if i == N:
stop = len(Mylist)
with open(f'file{i}.txt', 'w') as f:
f.write(','.join(map(str,Mylist[start:stop])))
start = stop

How to separate different input formats from the same text file with Python

I'm new to programming and python and I'm looking for a way to distinguish between two input formats in the same input file text file. For example, let's say I have an input file like so where values are comma-separated:
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
Where the format is N followed by N lines of Data1, and M followed by M lines of Data2. I tried opening the file, reading it line by line and storing it into one single list, but I'm not sure how to go about to produce 2 lists for Data1 and Data2, such that I would get:
Data1 = ["Washington,A,10", "New York,B,20", "Seattle,C,30", "Boston,B,20", "Atlanta,D,50"]
Data2 = ["New York,5", "Boston,10"]
My initial idea was to iterate through the list until I found an integer i, remove the integer from the list and continue for the next i iterations all while storing the subsequent values in a separate list, until I found the next integer and then repeat. However, this would destroy my initial list. Is there a better way to separate the two data formats in different lists?
You could use itertools.islice and a list comprehension:
from itertools import islice
string = """
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
"""
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [string.split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
This yields
[['Washington,A,10', 'New York,B,20', 'Seattle,C,30', 'Boston,B,20', 'Atlanta,D,50'], ['New York,5', 'Boston,10']]
For a file, you need to change it to:
with open("testfile.txt", "r") as f:
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [f.read().split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
You're definitely on the right track.
If you want to preserve the original list here, you don't actually have to remove integer i; you can just go on to the next item.
Code:
originalData = []
formattedData = []
with open("data.txt", "r") as f :
f = list(f)
originalData = f
i = 0
while i < len(f): # Iterate through every line
try:
n = int(f[i]) # See if line can be cast to an integer
originalData[i] = n # Change string to int in original
formattedData.append([])
for j in range(n):
i += 1
item = f[i].replace('\n', '')
originalData[i] = item # Remove newline char in original
formattedData[-1].append(item)
except ValueError:
print("File has incorrect format")
i += 1
print(originalData)
print(formattedData)
The following code will produce a list results which is equal to [Data1, Data2].
The code assumes that the number of entries specified is exactly the amount that there is. That means that for a file like this, it will not work.
2
New York,5
Boston,10
Seattle,30
The code:
# get the data from the text file
with open('filename.txt', 'r') as file:
lines = file.read().splitlines()
results = []
index = 0
while index < len(lines):
# Find the start and end values.
start = index + 1
end = start + int(lines[index])
# Everything from the start up to and excluding the end index gets added
results.append(lines[start:end])
# Update the index
index = end

How to read and delete first n lines from file in Python - Elegant Solution[2]

Originally posted here: How to read and delete first n lines from file in Python - Elegant Solution
I have a pretty big file ~ 1MB in size and I would like to be able to read first N lines, save them into a list (newlist) for later use, and then delete them.
My original code was:
import os
n = 3 #the number of line to be read and deleted
with open("bigFile.txt") as f:
mylist = f.read().splitlines()
newlist = mylist[:n]
os.remove("bigFile.txt")
thefile = open('bigFile.txt', 'w')
del mylist[:n]
for item in mylist:
thefile.write("%s\n" % item)
Based on Jean-François Fabre code that was posted and later deleted here I am able to run the following code:
import shutil
n = 3
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
for _ in range(n):
next(f)
f2.writelines(f)
This works great for deleting the first n lines and "updating" the bigFile.txt but when I try to store the first n values into a list so I can later use them like this:
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
mylist = f.read().splitlines()
newlist = mylist[:n]
for _ in range(n):
next(f)
f2.writelines(f)
I get an "StopIteration" error
In your sample code you are reading the entire file to find the first n lines:
# this consumes the entire file
mylist = f.read().splitlines()
This leaves nothing left to read for the subsequent code. Instead simply do:
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
# read the first n lines into newlist
newlist = [f.readline() for _ in range(n)]
f2.writelines(f)
I would proceed as follows:
n = 3
yourlist = []
with open("bigFile.txt") as f, open("bigFile2.txt", "w") as f2:
i=0
for line in f:
i += 1
if i<n:
yourlist.append(line)
else:
f2.write(f)

Increase Python code speed

Does anyone have a clue how to increase the speed of this part of python code?
It was designed to deal with small files (with just a few lines, and for this is very fast) but i want to run it with big files (with ~50Gb, and millions of lines).
The main goal of this code is to get stings from a file (.txt) and search for these in a input file printing the the number of times that these occurred in the output file.
Here is the code: infile, seqList and out are determined by the optparse as Options in the beginning of the code (not shown)
def novo (infile, seqList, out) :
uDic = dict()
rDic = dict()
nmDic = dict()
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
for k in lines:
l1 = k.split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
if l1[4] == "U":
for k in uDic.keys():
if k == l3:
uDic[k] += 1
if l1[4] == "R":
for j in rDic.keys():
if j == l3:
rDic[j] += 1
if l1[4] == "NM":
for h in nmDic.keys():
if h == l3:
nmDic[h] += 1
f = open(out, "w")
f.write("Sample"+"\t"+"R"+"\t"+"U"+"\t"+"NM"+"\t"+"TOTAL"+"\t"+"%R"+"\t"+"%U"+"\t"+"%NM"+"\n")
for i in samples:
U = int()
R = int()
NM = int ()
for k, j in uDic.items():
if k == i:
U = j
for o, p in rDic.items():
if o == i:
R = p
for y,u in nmDic.items():
if y == i:
NM = u
TOTAL = int(U + R + NM)
try:
f.write(i+"\t"+str(R)+"\t"+str(U)+"\t"+str(NM)+"\t"+str(TOTAL)+"\t"+str(float(R) / TOTAL)+"\t"+str(float(U) / TOTAL)+"\t"+str(float(NM) / TOTAL$
except:
continue
f.close()
With processing 50 GB files, the question is not how to make it faster, but how to make it runnable
at all.
The main problem is, you will run out of memory and shall modify the code to be processing the files
without having all the files in memory, but rather having in memory onle a line, which is needed.
Following code from your question is reading all the lines form two files:
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
# at this moment you are likely to run out of memory already
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
#similar loop over `lines` comes later on
You shall defer reading the lines till the latest possible moment like this:
#Create dictionaires with all the samples
with open(seqList, 'r') as RADlist:
for samplelines in RADlist:
sample = sampleline.strip()
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
Note: did you want to use line.strip() or line.split()?
This way, you do not have to keep all the content in memory.
There are many more options for optimization, but this one will let you to take off and run.
It would make it much easier if you provided some sample inputs. Because you haven't I haven't tested this, but the idea is simple - iterate through each file only once, using iterators rather than reading the whole file into memory. Use the efficient collections.Counter object to handle the counting and minimise inner looping:
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
If you convert your script into functions (it makes profiling easier), and then see what it does when you code profile it : I suggest using runsnake : runsnakerun
I would try replacing your loops with list & dictionary comprehensions:
For example, instead of
for i in samples:
uDict[i.replace(" ","")] = 0
Try:
udict = {i.replace(" ",""):0 for i in samples}
and similarly for the other dicts
I don't really follow what's going on in your "for k in lines" loop, but you only need l3 (and l2) when you have certain values for l1[4]. Why not check for those values before splitting and replacing?
Lastly, instead of looping through all the keys of a dict to see if a given element is in that dict, try:
if x in myDict:
myDict[x] = ....
for example:
for k in uDic.keys():
if k == l3:
uDic[k] += 1
can be replaced with:
if l3 in uDic:
uDic[l3] += 1
Other than that, try profiling.
1)Look into profilers and adjust the code that is taking the most time.
2)You could try optimizing some methods with Cython - use data from profiler to modify correct thing
3)It looks like you can use a counter instead of a dict for the output file, and a set for the input file -look into them.
set = set()
from Collections import Counter
counter = Counter() # Essentially a modified dict, that is optimized for counting...
# like counting occurences of strings in a text file
4) If you are reading 50GB of memory you won't be able to store it all in RAM (I'm assuming who knows what kind of computer you have), so generators should save your memory and time.
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)

How to delete line from the file in python

I have a file F, content huge numbers e.g F = [1,2,3,4,5,6,7,8,9,...]. So i want to loop over the file F and delete all lines contain any numbers in file say f = [1,2,4,7,...].
F = open(file)
f = [1,2,4,7,...]
for line in F:
if line.contains(any number in f):
delete the line in F
You can not immediately delete lines in a file, so have to create a new file where you write the remaining lines to. That is what chonws example does.
It's not clear to me what the form of the file you are trying to modify is. I'm going to assume it looks like this:
1,2,3
4,5,7,19
6,2,57
7,8,9
128
Something like this might work for you:
filter = set([2, 9])
lines = open("data.txt").readlines()
outlines = []
for line in lines:
line_numbers = set(int(num) for num in line.split(","))
if not filter & line_numbers:
outlines.append(line)
if len(outlines) < len(lines):
open("data.txt", "w").writelines(outlines)
I've never been clear on what the implications of doing an open() as a one-off are, but I use it pretty regularly, and it doesn't seem to cause any problems.
exclude = set((2, 4, 8)) # is faster to find items in a set
out = open('filtered.txt', 'w')
with open('numbers.txt') as i: # iterates over the lines of a file
for l in i:
if not any((int(x) in exclude for x in l.split(','))):
out.write(l)
out.close()
I'm assuming the file contains only integer numbers separated by ,
Something like this?:
nums = [1, 2]
f = open("file", "r")
source = f.read()
f.close()
out = open("file", "w")
for line in source.splitlines():
found = False
for n in nums:
if line.find(str(n)) > -1:
found = True
break
if found:
continue
out.write(line+"\n")
out.close()

Categories

Resources