Reading different text file and exctract same index lines [closed] - python

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 7 years ago.
Improve this question
I am not a python programmer. But I need to make a input file for a software. I have a a.txt file and b.txt and each line in a.txt is corresponded to an "indexes" in b.txt.
a.txt:
0 0 0 L M L 41 ACC sh 1.008732
1 0 0 L * L 53 NCR sh 1.022706
2 1 1 L M L 18 ACC sh 1.025172
3 2 2 L M L 17 ACC sh 1.017734
4 2 2 L * L 21 NCR sh 1.025410
b.txt:
#indexes: 0 0 0
1 -0.375E+04 0.382E+01
2 -0.375E+04 0.432E+01
3 -0.376E+04 0.353E+01
#indexes: 1 0 0
1 -0.635E+04 0.331E+01
2 -0.235E+04 0.238E+01
#indexes: 2 1 1
1 -0.735E+04 0.093E+01
#indexes: 3 2 2
1 -0.835E+04 0.331E+01
2 -0.035E+04 0.438E+01
#indexes: 4 2 2
1 -0.475E+04 0.331E+01
2 -0.365E+04 0.438E+01
I need to extract lines with "ACC" in the 8th column in a.txt and store them in a new a_new.txt.
a_new.txt:
0 0 0 L M L 41 ACC sh 1.008732
2 1 1 L M L 18 ACC sh 1.025172
3 2 2 L M L 17 ACC sh 1.017734
Then read b.txt file, find "indexes" lines and see if the numbers in that line are the the same as ACC lines (first 3 coulmns) then store that index box in b_new.txt:
b_new.txt:
#indexes: 0 0 0
1 -0.375E+04 0.382E+01
2 -0.375E+04 0.432E+01
3 -0.376E+04 0.353E+01
#indexes: 2 1 1
1 -0.735E+04 0.093E+01
#indexes: 3 2 2
1 -0.835E+04 0.331E+01
2 -0.035E+04 0.438E+01
I would appreciate if any body could help me?

Took a couple of minutes to do this:
import re
f = open('a.txt','r')
a = f.read()
f.close()
a_new = open('a_new.txt','w')
a_new.write('\n'.join(re.findall('(^.*ACC.*$)',a,re.M)))
a_new.close()
f = open('b.txt','r')
b = f.read()
f.close()
with open('b_new.txt','w') as b_new,open('a_new.txt','r') as a_new:
inds = [x.replace(' ','') for x in re.findall('^\s*(\d\s*\d\s*\d)',a_new.read(),re.M)]
for ind in inds:
reg = '(#indexes:\s*{0}\s*{1}\s*{2}[\s\S]*?(?=#indexes|$))'.format(*list(ind))
matches = re.findall(reg,b)
b_new.write('\n'.join(matches))
after running it, the a_new.txt will be like:
0 0 0 L M L 41 ACC sh 1.008732
2 1 1 L M L 18 ACC sh 1.025172
3 2 2 L M L 17 ACC sh 1.017734
and b_new.txt :
#indexes: 0 0 0
1 -0.375E+04 0.382E+01
2 -0.375E+04 0.432E+01
3 -0.376E+04 0.353E+01
#indexes: 2 1 1
1 -0.735E+04 0.093E+01
#indexes: 3 2 2
1 -0.835E+04 0.331E+01
2 -0.035E+04 0.438E+01

Related

How to read a text file using pandas in Python and split each character/letter of the data frame

How to read a text file using pandas in Python and split each character/letter of the data frame
text_df:
7 3
Tno
h%n
a #
tA
$c
#T%
ii!
And, I want the file to be as below:
7 3
T n o
h % n
a #
t A
$ c
# T %
i i !
Can anyone help me out with this? I tried with the below code but not working out:
df = pd.read_csv("location\\text_df.txt", sep='', header=None)
Use pd.read_fwf
pd.read_fwf('location\\text_df.txt', widths=[1,1,1], header=None)
0 1 2
0 7 NaN 3
1 T n o
2 h % n
3 a NaN #
4 t A NaN
5 $ c NaN
6 # T %
7 i i !
Or
pd.read_fwf('location\\text_df.txt', widths=[1,1,1], header=None).fillna('')
0 1 2
0 7 3
1 T n o
2 h % n
3 a #
4 t A
5 $ c
6 # T %
7 i i !

takes a list of lists of numbers and displays them as strings in a grid

I am trying to define a function that that takes a list of lists such as [[0,1,2,3,4,5],[0,1,4,9,16,25],[0,1,8,27,64,125]] and returns a grid of the numbers using "\t" like this
0 1 2 3 4 5
0 1 4 9 16 25
0 1 8 27 64 125
So far all I have is:
def print_table(alist):
for i in alist:
print(i)
which just prints each list out nicely... but still in a list.
You could do as follows:
l = [[0,1,2,3,4,5],[0,1,4,9,16,25],[0,1,8,27,64,125]]
print("\n".join("\t".join(map(str, v)) for v in l))
Which results in:
0 1 2 3 4 5
0 1 4 9 16 25
0 1 8 27 64 125
If you want to reuse this code in a function, you can make simple lambda for it:
as_grid = lambda in_list: "\n".join("\t".join(map(str, v)) for v in l)
print(as_grid(l))
lists = [[0,1,2,3,4,5],[0,1,4,9,16,25],[0,1,8,27,64,125]]
y = []
for l in lists:
l = [str(z) for z in l]
y.append('\t'.join(l))
print '\n'.join(y)
This prints:
0 1 2 3 4 5
0 1 4 9 16 25
0 1 8 27 64 125

Same code works diffferently regarding file size

I am running a simple code to select text from lines in the input file and write that text to an output file.
with open('inputpath', 'r') as vh_datoteka, open('outputpath', 'w') as iz_datoteka:
for line in vh_datoteka:
NMEA = str(line) [24:-39]
iz_datoteka.write (NMEA + '\n')
The data I need to process looks something like this (two lines):
2012-05-01
23:59:59.007;!AIVDM,1,1,0,,33cm>k100013vglDPkW1QSin0000,0*6E;2470028;1;NULL;2012-05-01
21:59:59.007 2012-05-01
23:59:59.007;!AIVDM,1,1,0,,19NSBn001nQ8<7vDhIq43C<2280F,0*07;2470032;1;NULL;2012-05-01
21:59:59.007 ...
Since I have large files to process (~2GB) I first tested the code on a small part of one of the large files (simply copied first 1000 or so lines and saved them into a test file).
The code worked perfectly and I got the results I was looking for:
!AIVDM,1,1,0,,33cm>k100013vglDPkW1QSin0000,0*6E;
!AIVDM,1,1,0,,19NSBn001nQ8<7vDhIq43C<2280F,0*07;
After that I tried using the code on the whole data and got very different outputs:
2 3 : 5 9 : 5 9 . 0 0 7 ; ! A I V D M , 1 , 1 , 0 , , 3 3 c m > k 1 0
0 0 1 3 v g l D P k W 1 Q S i n 0 0 0 0 , 0 * 6 E ; 2 4 7 0 0 2 8 ; 1
; N U L L ; 2 0 1 2 - 3 : 5 9 : 5 9 . 0 0 7 ; ! A I V D M , 1 , 1 ,
0 , , 1 9 N S B n 0 0 1 n Q 8 < 7 v D h I q 4 3 C < 2 2 8 0 F , 0 * 0
7 ; 2 4 7 0 0 3 2 ; 1 ; N U L L ; 2 0 1 2 - ...
I have been trying to figure out the reason for such behaviour and have ran out of ideas and obviously need help.
Thank you Tobias for your comment.
Apparently the large data files were in UTF16-LE, which was the problem. I corrected the python code to read in utf16 and write to utf8 and that did the trick.
with codecs.open('inputpath', 'r', encoding='utf-16-le') as vh_datoteka, open('outputpath', 'w') as iz_datoteka:
for line in vh_datoteka:
NMEA = str(line) [24:-39]
iz_line = NMEA + '\n'
iz_datoteka.write (iz_line.encode('utf-8'))

best way to implement Apriori in python pandas

What is the best way to implement the Apriori algorithm in pandas? So far I got stuck on transforming extracting out the patterns using for loops. Everything from the for loop onward does not work. Is there a vectorized way to do this in pandas?
import pandas as pd
import numpy as np
trans=pd.read_table('output.txt', header=None,index_col=0)
def apriori(trans, support=4):
ts=pd.get_dummies(trans.unstack().dropna()).groupby(level=1).sum()
#user input
collen, rowlen =ts.shape
#max length of items
tssum=ts.sum(axis=1)
maxlen=tssum.loc[tssum.idxmax()]
items=list(ts.columns)
results=[]
#loop through items
for c in range(1, maxlen):
#generate patterns
pattern=[]
for n in len(pattern):
#calculate support
pattern=['supp']=pattern.sum/rowlen
#filter by support level
Condit=pattern['supp']> support
pattern=pattern[Condit]
results.append(pattern)
return results
results =apriori(trans)
print results
When I insert this with support 3
a b c d e
0
11 1 1 1 0 0
666 1 0 0 1 1
10101 0 1 1 1 0
1010 1 1 1 1 0
414147 0 1 1 0 0
10101 1 1 0 1 0
1242 0 0 0 1 1
101 1 1 1 1 0
411 0 0 1 1 1
444 1 1 1 0 0
it should output something like
Pattern support
a 6
b 7
c 7
d 7
e 3
a,b 5
a,c 4
a,d 4
Assuming I understand what you're after, maybe
from itertools import combinations
def get_support(df):
pp = []
for cnum in range(1, len(df.columns)+1):
for cols in combinations(df, cnum):
s = df[list(cols)].all(axis=1).sum()
pp.append([",".join(cols), s])
sdf = pd.DataFrame(pp, columns=["Pattern", "Support"])
return sdf
would get you started:
>>> s = get_support(df)
>>> s[s.Support >= 3]
Pattern Support
0 a 6
1 b 7
2 c 7
3 d 7
4 e 3
5 a,b 5
6 a,c 4
7 a,d 4
9 b,c 6
10 b,d 4
12 c,d 4
14 d,e 3
15 a,b,c 4
16 a,b,d 3
21 b,c,d 3
[15 rows x 2 columns]
add support, confidence, and lift caculation。
def apriori(data, set_length=2):
import pandas as pd
df_supports = []
dataset_size = len(data)
for combination_number in range(1, set_length+1):
for cols in combinations(data.columns, combination_number):
supports = data[list(cols)].all(axis=1).sum() * 1.0 / dataset_size
confidenceAB = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[0]]==1])
confidenceBA = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[-1]]==1])
liftAB = confidenceAB * dataset_size / len(data[data[cols[-1]]==1])
liftBA = confidenceAB * dataset_size / len(data[data[cols[0]]==1])
df_supports.append([",".join(cols), supports, confidenceAB, confidenceBA, liftAB, liftBA])
df_supports = pd.DataFrame(df_supports, columns=['Pattern', 'Support', 'ConfidenceAB', 'ConfidenceBA', 'liftAB', 'liftBA'])
df_supports.sort_values(by='Support', ascending=False)
return df_supports

protein sequence coding

I'm working on a python program to compute a numerical coding of mutated residues and positions of a set of strings (protein sequences), stored in fasta format file with each protein sequence is separated by comma. I'm trying to find the position and sequences which are mutated.
My fasta file is as follows:
MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN
Example:
The following figure (based on another set of fasta file) will explain the algorithm behind this. In this figure first box represents alignment of input file sequences. The last box represents the output file. How can I do this with my fasta file in Python?
example input file:
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
Here are two ways I have tried to do it:
ls= 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
a=set().union(*pos)
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
print '\t'.join('1' if key in p else '0' for key in alle)
(here I'm getting columns of mutated as well as non-mutated residues, but I want only columns for mutated residues)
from pandas import *
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'
df = DataFrame([list(row) for row in data.split(',')])
df = DataFrame({str(col+1)+val:(df[col]==val).apply(int) for col in df.columns for val in set(df[col])})
print df.select(lambda x: not df[x].all(), axis = 1)
(here it is giving output ,but not in orderly ie, first 2K then 2T then 3A like that.)
How should I be doing this?
The function get_dummies gets you most of the way:
In [11]: s
Out[11]:
0 T
1 T
2 T
3 T
4 K
Name: 1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:
1K 1T
0 0 1
1 0 1
2 0 1
3 0 1
4 1 0
And those columns which have differing values:
In [21]: (df.ix[0] != df).any()
Out[21]:
0 False
1 True
2 True
3 False
4 True
5 False
Putting these together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = [pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I]
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Note: I created the initial DataFrame as follows, however this may be done more efficiently depending on your situation:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))

Categories

Resources