Python loop through two files, do computation, then output 3 files - python

I have 2 tab delimited files
for example:
file1:
12 23 43 34
433 435 76 76
file2:
123 324 53 65
12 457 54 32
I would like to loop through these 2 files, comparing every line of file1 with file2 and vice versa.
If, for example, the 1st number of 1st line in file1 is the same as the 1st number of 2nd line in file 2:
I would like to put from the 1st line in file1 in a file called output.
then I would like to put all the lines from file1 that didn't find a match in file 2 in a new file
and all the lines from file2 that didn't find a match in file1 in a new file.
so far I have been able to find the matching lines and put them in a file but I'm having trouble putting the lines that didn't match into 2 separate files.
one=open(file1, 'r').readlines()
two=open(file2, 'r').readlines()
output=open('output.txt', 'w')
count=0
list1=[] #list for lines in file1 that didn't find a match
list2=[] #list for lines in file2 that didn't find a match
for i in one:
for j in two:
columns1=i.strip().split('\t')
num1=int(columns1[0])
columns2=j.strip().split('\t')
num2=int(columns2[0])
if num1==num2:
count+=1
output.write(i+j)
else:
list1.append(i)
list2.append(j)
Problem I have here is with the else part.
Can someone show me the right and better way to do this, I would greatly appreciate.
EDIT: Thanks for the quick responses everyone
The 3 output I would be looking for is:
Output_file1: #Matching results between the 2 files
12 23 43 34 #line from file1
12 457 54 32 #line from file2
Output_file2: #lines from the first file that didn't find a match
433 435 76 76
Output_file3: #lines from the second file that didn't find a match
123 324 53 65

I would suggest that you use the csv module to read your files like so (you might have to mess around with the dialect, see http://docs.python.org/library/csv.html for help:
import csv
one = csv.reader(open(file1, 'r'), dialect='excell')
two = csv.reader(open(file2, 'r'), dialect='excell')
then you might find it easier to "zip" along the lines of both files at the same time like so (see http://docs.python.org/library/itertools.html#itertools.izip_longest):
import itertools
file_match = open('match', 'w')
file_nomatch1 = open('nomatch1', 'w')
file_nomatch2 = open('nomatch2', 'w')
for i,j in itertools.izip_longest(one, two, fillvalue="-"):
if i[0] == j[0]:
file_match.write(str(i)+'\n')
else:
file_nomatch1.write(str(i)+'\n')
file_nomatch2.write(str(j)+'\n')
# and maybe handle the case where one is "-"
I reread the post and realized you are looking for a match between ANY two lines in both files. Maybe someone will find the above code useful, but it wont solve your particular problem.

I'd suggest using set operation
from collections import defaultdict
def parse(filename):
result = defaultdict(list)
for line in open(filename):
# take the first number and put it in result
num = int(line.strip().split(' ')[0])
result[num].append(line)
return result
def select(selected, items):
result = []
for s in selected:
result.extend(items[s])
return result
one = parse('one.txt')
two = parse('two.txt')
one_s = set(one)
two_s = set(two)
intersection = one_s & two_s
one_only = one_s - two_s
two_only = two_s - one_s
one_two = defaultdict(list)
for e in one: one_two[e].extend(one[e])
for e in two: one_two[e].extend(two[e])
open('intersection.txt', 'w').writelines(select(intersection, one_two))
open('one_only.txt', 'w').writelines(select(one_only, one))
open('two_only.txt', 'w').writelines(select(two_only, two))

Think that it is not the best way but it works for me and looks prety easy for understanding:
# Sorry but was not able to check code below
def get_diff(fileObj1, fileObj2):
f1Diff = []
f2Diff = []
outputData = []
# x is one row
f1Data = set(x.strip() for x in fileObj1)
f2Data = set(x.strip() for x in fileObj2)
f1Column1 = set(x.split('\t')[0] for x in f1Data)
f2Column1 = set(x.split('\t')[0] for x in f2Data)
l1Col1Diff = f1Column1 ^ f2Column1
l2Col1Diff = f2Column1 ^ f1Column1
commonPart = f1Column1 & f2column1
for line in f1Data.union(f2Data):
lineKey = line.split('\t')[0]
if lineKey in common:
outputData.append(line)
elif lineKey in l1ColDiff:
f1Diff.append(line)
elif lineKey in l2ColDiff:
f2Diff.append(line)
return outputData, f1Diff, f2Diff
outputData, file1Missed, file2Missed = get_diff(open(file1, 'r'), open(file2, 'r'))

I think that this code fits your purposes
one=open(file1, 'r').readlines()
two=open(file2, 'r').readlines()
output=open('output.txt', 'w')
first = {x.split('\t')[0] for x in one}
second = {x.split('\t')[0] for x in two}
common = first.intersection( second )
list1 = filter( lambda x: not x.split('\t')[0] in common, one )
list2 = filter( lambda x: not x.split('\t')[0] in common, two )
res1 = filter( lambda x: x.split('\t')[0] in common, one )
res2 = filter( lambda x: x.split('\t')[0] in common, two )
count = len( res1 )
for x in range(count):
output.write( res1[x] )
output.write( res2[x] )

Related

How to separate different input formats from the same text file with Python

I'm new to programming and python and I'm looking for a way to distinguish between two input formats in the same input file text file. For example, let's say I have an input file like so where values are comma-separated:
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
Where the format is N followed by N lines of Data1, and M followed by M lines of Data2. I tried opening the file, reading it line by line and storing it into one single list, but I'm not sure how to go about to produce 2 lists for Data1 and Data2, such that I would get:
Data1 = ["Washington,A,10", "New York,B,20", "Seattle,C,30", "Boston,B,20", "Atlanta,D,50"]
Data2 = ["New York,5", "Boston,10"]
My initial idea was to iterate through the list until I found an integer i, remove the integer from the list and continue for the next i iterations all while storing the subsequent values in a separate list, until I found the next integer and then repeat. However, this would destroy my initial list. Is there a better way to separate the two data formats in different lists?
You could use itertools.islice and a list comprehension:
from itertools import islice
string = """
5
Washington,A,10
New York,B,20
Seattle,C,30
Boston,B,20
Atlanta,D,50
2
New York,5
Boston,10
"""
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [string.split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
This yields
[['Washington,A,10', 'New York,B,20', 'Seattle,C,30', 'Boston,B,20', 'Atlanta,D,50'], ['New York,5', 'Boston,10']]
For a file, you need to change it to:
with open("testfile.txt", "r") as f:
result = [[x for x in islice(parts, idx + 1, idx + 1 + int(line))]
for parts in [f.read().split("\n")]
for idx, line in enumerate(parts)
if line.isdigit()]
print(result)
You're definitely on the right track.
If you want to preserve the original list here, you don't actually have to remove integer i; you can just go on to the next item.
Code:
originalData = []
formattedData = []
with open("data.txt", "r") as f :
f = list(f)
originalData = f
i = 0
while i < len(f): # Iterate through every line
try:
n = int(f[i]) # See if line can be cast to an integer
originalData[i] = n # Change string to int in original
formattedData.append([])
for j in range(n):
i += 1
item = f[i].replace('\n', '')
originalData[i] = item # Remove newline char in original
formattedData[-1].append(item)
except ValueError:
print("File has incorrect format")
i += 1
print(originalData)
print(formattedData)
The following code will produce a list results which is equal to [Data1, Data2].
The code assumes that the number of entries specified is exactly the amount that there is. That means that for a file like this, it will not work.
2
New York,5
Boston,10
Seattle,30
The code:
# get the data from the text file
with open('filename.txt', 'r') as file:
lines = file.read().splitlines()
results = []
index = 0
while index < len(lines):
# Find the start and end values.
start = index + 1
end = start + int(lines[index])
# Everything from the start up to and excluding the end index gets added
results.append(lines[start:end])
# Update the index
index = end

Reading simultaneously different number of lines from two files

I'm looking for a way to read from two large files simultaneously without bringing the whole data into memory. I want to parse M lines from the first file with N lines from the second file. Is there any wise and memory efficient solution for it?
So far I know how to do it with reading two files at the same time line by line. But I don't know if it would be possible to extend this code to read for example 4 lines from the first file, and 1 line from the second file.
from itertools import izip
with open("textfile1") as textfile1, open("textfile2") as textfile2:
for x, y in izip(textfile1, textfile2):
x = x.strip()
y = y.strip()
print("{0}\t{1}".format(x, y))
from here, Read two textfile line by line simultaneously -python
Just open the files and use e.g. line = textfile1.readline() to read a line from one of the files.
line will contain a trailing newline. You see that you reached the end, when an empty string is returned.
That would read the next n lines from file1 then next m lines from file2 within some other code
def nextlines(number, file):
n_items = []
index = number
while(index > 0):
try:
n_items += [next(file).strip()]
index -= 1
except StopIteration:
break
return n_items
n = 5
m = 7
l1 = []
l2 = []
with open('file.dat', 'r') as file1:
with open('file.dat', 'r') as file2:
#some code
l1 = nextlines(n, file1)
l2 = nextlines(m, file2)
#some other code
file2.close()
file1.close()
print l1
print l2

Increase Python code speed

Does anyone have a clue how to increase the speed of this part of python code?
It was designed to deal with small files (with just a few lines, and for this is very fast) but i want to run it with big files (with ~50Gb, and millions of lines).
The main goal of this code is to get stings from a file (.txt) and search for these in a input file printing the the number of times that these occurred in the output file.
Here is the code: infile, seqList and out are determined by the optparse as Options in the beginning of the code (not shown)
def novo (infile, seqList, out) :
uDic = dict()
rDic = dict()
nmDic = dict()
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
for k in lines:
l1 = k.split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
if l1[4] == "U":
for k in uDic.keys():
if k == l3:
uDic[k] += 1
if l1[4] == "R":
for j in rDic.keys():
if j == l3:
rDic[j] += 1
if l1[4] == "NM":
for h in nmDic.keys():
if h == l3:
nmDic[h] += 1
f = open(out, "w")
f.write("Sample"+"\t"+"R"+"\t"+"U"+"\t"+"NM"+"\t"+"TOTAL"+"\t"+"%R"+"\t"+"%U"+"\t"+"%NM"+"\n")
for i in samples:
U = int()
R = int()
NM = int ()
for k, j in uDic.items():
if k == i:
U = j
for o, p in rDic.items():
if o == i:
R = p
for y,u in nmDic.items():
if y == i:
NM = u
TOTAL = int(U + R + NM)
try:
f.write(i+"\t"+str(R)+"\t"+str(U)+"\t"+str(NM)+"\t"+str(TOTAL)+"\t"+str(float(R) / TOTAL)+"\t"+str(float(U) / TOTAL)+"\t"+str(float(NM) / TOTAL$
except:
continue
f.close()
With processing 50 GB files, the question is not how to make it faster, but how to make it runnable
at all.
The main problem is, you will run out of memory and shall modify the code to be processing the files
without having all the files in memory, but rather having in memory onle a line, which is needed.
Following code from your question is reading all the lines form two files:
with open(infile, 'r') as infile, open(seqList, 'r') as RADlist :
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
# at this moment you are likely to run out of memory already
#Create dictionaires with all the samples
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
#similar loop over `lines` comes later on
You shall defer reading the lines till the latest possible moment like this:
#Create dictionaires with all the samples
with open(seqList, 'r') as RADlist:
for samplelines in RADlist:
sample = sampleline.strip()
for i in samples:
uDic[i.replace(" ","")] = 0
rDic[i.replace(" ","")] = 0
nmDic[i.replace(" ","")] = 0
Note: did you want to use line.strip() or line.split()?
This way, you do not have to keep all the content in memory.
There are many more options for optimization, but this one will let you to take off and run.
It would make it much easier if you provided some sample inputs. Because you haven't I haven't tested this, but the idea is simple - iterate through each file only once, using iterators rather than reading the whole file into memory. Use the efficient collections.Counter object to handle the counting and minimise inner looping:
def novo (infile, seqList, out):
from collections import Counter
import csv
# Count
counts = Counter()
with open(infile, 'r') as infile:
for line in infile:
l1 = line.strip().split("\t")
l2 = l1[0].split(";")
l3 = l2[0].replace(">","")
if len(l1)<2:
continue
counts[(l1[4], l3)] += 1
# Produce output
types = ['R', 'U', 'NM']
with open(seqList, 'r') as RADlist, open(out, 'w') as outfile:
f = csv.writer(outfile, delimiter='\t')
f.writerow(types + ['TOTAL'] + ['%' + t for t in types])
for sample in RADlist:
sample = sample.strip()
countrow = [counts((t, sample)) for t in types]
total = sum(countrow)
f.writerow([sample] + countrow + [total] + [c/total for c in countrow])
samples = [line.strip() for line in RADlist]
lines = [line.strip() for line in infile]
If you convert your script into functions (it makes profiling easier), and then see what it does when you code profile it : I suggest using runsnake : runsnakerun
I would try replacing your loops with list & dictionary comprehensions:
For example, instead of
for i in samples:
uDict[i.replace(" ","")] = 0
Try:
udict = {i.replace(" ",""):0 for i in samples}
and similarly for the other dicts
I don't really follow what's going on in your "for k in lines" loop, but you only need l3 (and l2) when you have certain values for l1[4]. Why not check for those values before splitting and replacing?
Lastly, instead of looping through all the keys of a dict to see if a given element is in that dict, try:
if x in myDict:
myDict[x] = ....
for example:
for k in uDic.keys():
if k == l3:
uDic[k] += 1
can be replaced with:
if l3 in uDic:
uDic[l3] += 1
Other than that, try profiling.
1)Look into profilers and adjust the code that is taking the most time.
2)You could try optimizing some methods with Cython - use data from profiler to modify correct thing
3)It looks like you can use a counter instead of a dict for the output file, and a set for the input file -look into them.
set = set()
from Collections import Counter
counter = Counter() # Essentially a modified dict, that is optimized for counting...
# like counting occurences of strings in a text file
4) If you are reading 50GB of memory you won't be able to store it all in RAM (I'm assuming who knows what kind of computer you have), so generators should save your memory and time.
#change list comprehension to generators
samples = (line.strip() for line in RADlist)
lines = (line.strip() for line in infile)

Iterate through lines changing only one character python

I have a file that looks like this
N1 1.023 2.11 3.789
Cl1 3.124 2.4534 1.678
Cl2 # # #
Cl3 # # #
Cl4
Cl5
N2
Cl6
Cl7
Cl8
Cl9
Cl10
N3
Cl11
Cl12
Cl13
Cl14
Cl15
The three numbers continue down throughout.
What I would like to do is pretty much a permutation. These are 3 data sets, set 1 is N1-Cl5, 2 is N2-Cl10, and set three is N3 - end.
I want every combination of N's and Cl's. For example the first output would be
Cl1
N1
Cl2
then everything else the same. the next set would be Cl1, Cl2, N1, Cl3...and so on.
I have some code but it won't do what I want, becuase it would know that there are three individual data sets. Should I have the three data sets in three different files and then combine, using a code like:
list1 = ['Cl1','Cl2','Cl3','Cl4', 'Cl5']
for line in file1:
line.replace('N1', list1(0))
list1.pop(0)
print >> file.txt, line,
or is there a better way?? Thanks in advance
This should do the trick:
from itertools import permutations
def print_permutations(in_file):
separators = ['N1', 'N2', 'N3']
cur_separator = None
related_elements = []
with open(in_file, 'rb') as f:
for line in f:
line = line.strip()
# Split Nx and CIx from numbers.
value = line.split()[0]
# Found new Nx. Print previous permutations.
if value in separators and related_elements:
for perm in permutations([cur_separator] + related_elements)
print perm
cur_separator = line
related_elements = []
else:
# Found new CIx. Append to the list.
related_elements.append(value)
You could use regex to find the line numbers of the "N" patterns and then slice the file using those line numbers:
import re
n_pat = re.compile(r'N\d')
N_matches = []
with open(sample, 'r') as f:
for num, line in enumerate(f):
if re.match(n_pat, line):
N_matches.append((num, re.match(n_pat, line).group()))
>>> N_matches
[(0, 'N1'), (12, 'N2'), (24, 'N3')]
After you figure out the line numbers where these patterns appear, you can then use itertools.islice to break the file up into a list of lists:
import itertools
first = N_matches[0][0]
final = N_matches[-1][0]
step = N_matches[1][0]
data_set = []
locallist = []
while first < final + step:
with open(file, 'r') as f:
for item in itertools.islice(f, first, first+step):
if item.strip():
locallist.append(item.strip())
dataset.append(locallist)
locallist = []
first += step
itertools.islice is a really nice way to take a slice of an iterable. Here's the result of the above on a sample:
>>> dataset
[['N1 1.023 2.11 3.789', 'Cl1 3.126 2.6534 1.878', 'Cl2 3.124 2.4534 1.678', 'Cl3 3.924 2.1134 1.1278', 'Cl4', 'Cl5'], ['N2', 'Cl6 3.126 2.6534 1.878', 'Cl7 3.124 2.4534 1.678', 'Cl8 3.924 2.1134 1.1278', 'Cl9', 'Cl10'], ['N3', 'Cl11', 'Cl12', 'Cl13', 'Cl14', 'Cl15']]
After that, I'm a bit hazy on what you're seeking to do, but I think you want permutations of each sublist of the dataset? If so, you can use itertools.permutations to find permutations on various sublists of dataset:
for item in itertools.permutations(dataset[0]):
print(item)
etc.
Final Note:
Assuming I understand correctly what you're doing, the number of permutations is going to be huge. You can calculate how many permutations there are them by taking the factorial of the number of items. Anything over 10 (10!) is going to produce over 3,000,000 million permutations.

find common list between files

I have three text file:
fileA:
13 abc
123 def
234 ghi
1234 jkl
12 mno
fileB:
12 abc
12 def
34 qwe
43 rty
45 mno
fileC:
12 abc
34 sdg
43 yui
54 poi
54 def
I would like to see what all the values in the 2nd column are matching between the files. The following code works if the 2nd column is already sorted. but if the 2nd column is not sorted, how do i sort the 2nd column and compare the files ?
fileA = open("A.txt",'r')
fileB = open("B.txt",'r')
fileC = open("C.txt",'r')
listA1 = []
for line1 in fileA:
listA = line1.split('\t')
listA1.append(listA)
listB1 = []
for line1 in fileB:
listB = line1.split('\t')
listB1.append(listB)
listC1 = []
for line1 in fileC:
listC = line1.split('\t')
listC1.append(listC)
for key1 in listA1:
for key2 in listB1:
for key3 in listC1:
if key1[1] == key2[1] and key2[1] == key3[1] and key3[1] == key1[1]:
print "Common between three files:",key1[1]
print "Common between file1 and file2 files:"
for key1 in listA1:
for key2 in listB1:
if key1[1] == key2[1]:
print key1[1]
print "Common between file1 and file3 files:"
for key1 in listA1:
for key2 in listC1:
if key1[1] == key2[1]:
print key1[1]
If you just want to sort A1, B1, and C1 by the second column, this is easy:
listA1.sort(key=operator.itemgetter(1))
If you don't understand itemgetter, this is the same:
listA1.sort(key=lambda element: element[1])
However, I think a better solution is to just use a set:
setA1 = set(element[1] for element in listA1)
setB1 = set(element[1] for element in listB1)
setC1 = set(element[1] for element in listC1)
Or, more simply, don't build the lists in the first place; do this:
setA1 = set()
for line1 in fileA:
listA = line1.split('\t')
setA1.add(listA[1])
Either way:
print "Common between file1 and file2 files:"
for key in setA1 & setA2:
print key
To simplify it further, you probably want to refactor the repeated stuff into functions first:
def read_file(path):
with open(path) as f:
result = set()
for line in f:
columns = line.split('\t')
result.add(columns[1])
return result
setA1 = read_file('A.txt')
setB1 = read_file('B.txt')
setC1 = read_file('C.txt')
And then you can find further opportunities. For example:
def read_file(path):
with open(path) as f:
return set(row[1] for row in csv.reader(f))
As John Clements points out, you don't even really need all three of them to be sets, just A1, so you could instead do this:
def read_file(path):
with open(path) as f:
for row in csv.reader(f):
yield row[1]
setA1 = set(read_file('A.txt'))
iterB1 = read_file('B.txt')
iterC1 = read_file('B.txt')
The only other change you need is that you have to call intersection instead of using the & operator, so:
for key in setA1.intersection(iterB1):
I'm not sure this last change is actually an improvement. But in Python 3.3, where the only thing you need to do is change the return set(…) into yield from (…), I probably would do it this way. (Even if the files are huge and have tons of duplicates, so there was a performance cost to it, I'd just stick unique_everseen from the itertools recipes around the read_file calls.)

Categories

Resources