python: combine multiple files into a matrix with 1 and 0

python: combine multiple files into a matrix with 1 and 0 - python

There are several files like this:
sample_a.txt containing:
a
b
c
sample_b.txt containing:
b
w
e
sample_c.txt containing:
a
m
n
I want to make a matrix of absence/presence like this:
a b c w e m n
sample_a 1 1 1 0 0 0 0
sample_b 0 1 0 1 1 0 0
sample_c 1 0 0 0 0 1 1
I know a dirty and dumb way how to solve it: make up a list of all possible letters in those files, and then iteratively comparing each line of each file with this 'library' fill in the final matrix by index. But I guess there's a smarter solution. Any ideas?
Upd:
the sample files can be of different length.

You can try:
import pandas as pd
from collections import defaultdict
dd = defaultdict(list) # dictionary where each value per key is a list
files = ["sample_a.txt","sample_b.txt","sample_c.txt"]
for file in files:
with open(file,"r") as f:
for row in f:
dd[file.split(".")[0]].append(row[0])
#appending to dictionary dd:
#KEY: file.split(".")[0] is file name without extension
#VALUE: row[0] is first character of line in text file
# (second character was new line '\n' so I removed it)
df = pd.DataFrame.from_dict(dd, orient='index').T.melt() #converting dictionary to long format of dataframe
pd.crosstab(df.variable, df.value) #make crosstab, similar to pd.pivot_table
result:
value a b c e f m n o p w
variable
sample_a 1 1 1 0 0 0 0 0 0 0
sample_b 0 1 0 1 1 0 0 0 0 1
sample_c 1 0 0 0 0 1 1 1 1 0
Please note letters (columns) are in alphabetical order.

Related

Rewrite for loop as while loop

I am learning python.
For the code below, how to convert for loop to while loop in an efficient way?
import pandas as pd
transactions01 = []
file=open('raw-data1.txt','w')
file.write('HotDogs,Buns\nHotDogs,Buns\nHotDogs,Coke,Chips\nChips,Coke\nChips,Ketchup\nHotDogs,Coke,Chips\n')
file.close()
file=open('raw-data1.txt','r')
lines = file.readlines()
for line in lines:
items = line[:-1].split(',')
has_item = {}
for item in items:
has_item[item] = 1
transactions01.append(has_item)**
file.close()
data = pd.DataFrame(transactions01)
data.fillna(0, inplace = True)
data

Code :
i = 0
while i<len(lines):
items = lines[i][:-1].split(',')
has_item = {}
j = 0
while j<len(items):
has_item[items[j]]=1
j+=1
transactions01.append(has_item)
i+=1

It looks like you could just take your, use the csv module to parse the file as you've got an inconsistent number of rows per column, then turn it into a dataframe, use pd.get_dummies to get 0/1's per item present, then aggregate back to a row level to product your final output, eg:
import pandas as pd
import csv
with open('raw-data1.txt') as fin:
df = pd.get_dummies(pd.DataFrame(csv.reader(fin)).stack()).groupby(level=0).max()
Will give you a df of:
Buns Chips Coke HotDogs Ketchup
0 1 0 0 1 0
1 1 0 0 1 0
2 0 1 1 1 0
3 0 1 1 0 0
4 0 1 0 0 1
5 0 1 1 1 0
.. which you can then write back out as CSV if required.

How to merge strings that have certain number of substrings in common to produce some groups in a data frame in Python

I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.
But here, I have an advanced version of the similar question:
I have a sample data:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
So, I can have three groups and there is no common sub strings between any two groups.
Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
The result is:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:
group
'b,c', 0
'a', 1
'a,c,d,e', 2
'f,g,h,i', 3
'j,k,l', 4
'k,l,m' 4

Use:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
from itertools import combinations, chain
from collections import Counter
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]
#create new columns for matched sets
for val in f1:
j = ','.join(val)
a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)
#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]
a = a[['ACTIVITY']].assign(group = new)
print (a)
ACTIVITY group
0 b,c 0
1 a 1
2 a,c,d,e 2
3 f,g,h,i 3
4 j,k,l 4
5 k,l,m 4

extracting a column from the text file containing headers and delimiters

I have a text file which looks like this:
~Date and Time of Data Converting: 15.02.2019 16:12:44
~Name of Test: XXX
~Address: ZZZ
~ID: OPP
~Testchannel: CH06
~a;b;DateTime;c;d;e;f;g;h;i;j;k;extract;l;m;n;o;p;q;r
0;1;04.03.2019 07:54:19;0;0;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;3;0;0;0;0;0
5,5523894132E-7;2;04.03.2019 07:54:19;5,5523894132E-7;5,5523894132E-7;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;0;0;0;0;0;0
0,00277777777779538;3;04.03.2019 07:54:29;0,00277777777779538;0,00277777777779538;2;Pause;3,5724446855812;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,00555555532278617;4;04.03.2019 07:54:39;0,00555555532278617;0,00555555532278617;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;1;0;0;0;0;0
0,00833333333338613;5;04.03.2019 07:54:49;0,00833333333338613;0,00833333333338613;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,0111112040002119;6;04.03.2019 07:54:59;0,0111112040002119;0,0111112040002119;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,013888887724954;7;04.03.2019 07:55:09;0,013888887724954;0,013888887724954;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
I need to extract the values from the column named extract, and need to store the output as an excel file.
Can anyone give me any idea how I can proceed?
So far, I have only been able to create an empty excel file for the output, and I have read the text file. I however don't know how to append output to the empty excel file.
import os
file=open('extract.csv', "a")
if os.path.getsize('extract.csv')==0:
file.write(" "+";"+"Datum"+";"+"extract"+";")
with open('myfile.txt') as f:
dat=[f.readline() for x in range(10)]
datum=dat[7].split(' ')[3]
data = np.genfromtxt('myfile.txt', delimiter=';', skip_header=12,dtype=str)

You can use the pandas module.
You need to read skip the first lines of your text file. Here, I consider not to know how many there are. I loop until I find a data row.
Then read the data.
Finaly, export it as dataframe with to_excel (doc)
Here the code:
# Import module
import pandas as pd
# Read file
with open('temp.txt') as f:
content = f.read().split("\n")
# Skip the first lines (find number start data)
for i, line in enumerate(content):
if line and line[0] != '~': break
# Columns names and data
header = content[i - 1][1:].split(';')
data = [row.split(';') for row in content[i:]]
# Store in dataframe
df = pd.DataFrame(data, columns=header)
print(df)
# a b DateTime c d e f ... l m n o p q r
# 0 0 1 04.03.2019 07:54:19 0 0 2 Pause ... 1 3 0 0 0 0 0
# 1 5,5523894132E-7 2 04.03.2019 07:54:19 5,5523894132E-7 5,5523894132E-7 2 Pause ... 1 0 0 0 0 0 0
# 2 0,00277777777779538 3 04.03.2019 07:54:29 0,00277777777779538 0,00277777777779538 2 Pause ... 1 1 0 0 0 0 0
# 3 0,00555555532278617 4 04.03.2019 07:54:39 0,00555555532278617 0,00555555532278617 2 Pause ... 1 1 0 0 0 0 0
# 4 0,00833333333338613 5 04.03.2019 07:54:49 0,00833333333338613 0,00833333333338613 2 Pause ... 1 1 0 0 0 0 0
# 5 0,0111112040002119 6 04.03.2019 07:54:59 0,0111112040002119 0,0111112040002119 2 Pause ... 1 1 0 0 0 0 0
# 6 0,013888887724954 7 04.03.2019 07:55:09 0,013888887724954 0,013888887724954 2 Pause ... 1 1 0 0 0 0 0
# Select only the Extract column
# df = df.Extract
# Save the data in excel file
df.to_excel("OutPut.xlsx", "MySheetName", index=False)
Note: if you know the number of lines to skip, you can simply load the dataframe with read_csv using the skiprows parameter. (doc).
Hope that helps!

Loop to multiply column i by 2, add column j, and repeat for all pairs of columns in a file [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I need to convert a genotype dosage file into an allelic dosage file.
Input looks like this:
#snp a1 a2 i1 j1 i2 j2 i3 j3
chr6_24000211_D D I3 0 0 0 0 0 0
rs78244999 A G 1 0 1 0 1 0
rs1511479 T C 0 1 1 0 0 1
rs34425199 A C 0 0 0 0 0 0
rs181892770 A G 1 0 1 0 1 0
rs501871 A G 0 1 0.997 0.003 0 1
chr6_24000836_D D I4 0 0 0 0 0 0
chr6_24000891_I I2 D 0 0 0 0 0 1
rs16888446 A C 0 0 0 0 0 0
Columns 1-3 are identifiers. No operations should be performed on these, they need to just be copied as is into the output file. For the remaining columns, they need to be considered as a pair of column i and column j and the following operation needs to be performed: 2*i + j
Pseudocode
write first three columns of input file to output
for all i and j in the file, write 2*i + j to output
Desired output looks like this:
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
I will be performing this on a number of files with different total columns, so I want the loop to run for (total number of columns - 3)/2 iterations, i.e. until it reaches the last column of the file.
Input files are ~9 million rows by ~10,000 columns, so reading the files into a program such as R is very slow. I am not sure the most efficient tool to use to implement this (awk? perl? python?), and as a novice coder I unsure of where to begin re: syntax for the solution.

Here's the awk implementation of your posted algorithm, enhanced just slightly to produce the first row you show in your expected output:
$ cat tst.awk
{
printf "%s %s %s", $1, $2, $3
c=0
for (i=4; i<NF; i+=2) {
printf " %s", (NR>1 ? 2*$i + $(i+1) : ++c)
}
print ""
}
.
$ awk -f tst.awk file
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0

Python version
#!/usr/bin/env python
from itertools import izip_longest, chain
def chunk(sequence, chunk_size=2):
"""
list(chunk([1,2,3,4], 2)) => [(1,2),(3,4)]
"""
# Take advantage of the same iterator being consumed
# multiple times/sources to do grouping
return izip_longest(*[iter(sequence)] * chunk_size)
def processor(csv_reader):
for row in csv_reader:
# collect the pairs and process them
processed_pairs = (2*float(i)+float(j) for i, j in chunk(row[3:]))
# yield back the first 3 element and the processed pairs
yield list(i for j in (row[0:3], processed_pairs) for i in j)
if __name__ == '__main__':
import csv, sys
with open(sys.argv[1], 'rb') as csvfile:
source = processor(csv.reader(csvfile, delimiter=' '))
for line in source:
print line

This will do as you ask. It expects the input file as a parameter on the command line and will send the output to STDOUT, which you can redirect to a file if you wish.
use strict;
use warnings;
while (<>) {
my #fields = split;
my #probs = splice #fields, 3;
if (/^#/) {
push #fields, 1 .. #probs / 2;
}
else {
while (#probs >= 2) {
my ($i, $j) = splice #probs, 0, 2;
push #fields, $i + $i + $j;
}
}
print "#fields\n";
}
output
#SNP A1 A2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0

protein sequence coding

I'm working on a python program to compute a numerical coding of mutated residues and positions of a set of strings (protein sequences), stored in fasta format file with each protein sequence is separated by comma. I'm trying to find the position and sequences which are mutated.
My fasta file is as follows:
MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN
Example:
The following figure (based on another set of fasta file) will explain the algorithm behind this. In this figure first box represents alignment of input file sequences. The last box represents the output file. How can I do this with my fasta file in Python?
example input file:
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
Here are two ways I have tried to do it:
ls= 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
a=set().union(*pos)
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
print '\t'.join('1' if key in p else '0' for key in alle)
(here I'm getting columns of mutated as well as non-mutated residues, but I want only columns for mutated residues)
from pandas import *
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'
df = DataFrame([list(row) for row in data.split(',')])
df = DataFrame({str(col+1)+val:(df[col]==val).apply(int) for col in df.columns for val in set(df[col])})
print df.select(lambda x: not df[x].all(), axis = 1)
(here it is giving output ,but not in orderly ie, first 2K then 2T then 3A like that.)
How should I be doing this?

The function get_dummies gets you most of the way:
In [11]: s
Out[11]:
0 T
1 T
2 T
3 T
4 K
Name: 1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:
1K 1T
0 0 1
1 0 1
2 0 1
3 0 1
4 1 0
And those columns which have differing values:
In [21]: (df.ix[0] != df).any()
Out[21]:
0 False
1 True
2 True
3 False
4 True
5 False
Putting these together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = [pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I]
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Note: I created the initial DataFrame as follows, however this may be done more efficiently depending on your situation:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python: combine multiple files into a matrix with 1 and 0 - python

Related

Rewrite for loop as while loop

How to merge strings that have certain number of substrings in common to produce some groups in a data frame in Python

extracting a column from the text file containing headers and delimiters

Loop to multiply column i by 2, add column j, and repeat for all pairs of columns in a file [closed]

protein sequence coding

Categories

Resources