Related
I am learning python.
For the code below, how to convert for loop to while loop in an efficient way?
import pandas as pd
transactions01 = []
file=open('raw-data1.txt','w')
file.write('HotDogs,Buns\nHotDogs,Buns\nHotDogs,Coke,Chips\nChips,Coke\nChips,Ketchup\nHotDogs,Coke,Chips\n')
file.close()
file=open('raw-data1.txt','r')
lines = file.readlines()
for line in lines:
items = line[:-1].split(',')
has_item = {}
for item in items:
has_item[item] = 1
transactions01.append(has_item)**
file.close()
data = pd.DataFrame(transactions01)
data.fillna(0, inplace = True)
data
Code :
i = 0
while i<len(lines):
items = lines[i][:-1].split(',')
has_item = {}
j = 0
while j<len(items):
has_item[items[j]]=1
j+=1
transactions01.append(has_item)
i+=1
It looks like you could just take your, use the csv module to parse the file as you've got an inconsistent number of rows per column, then turn it into a dataframe, use pd.get_dummies to get 0/1's per item present, then aggregate back to a row level to product your final output, eg:
import pandas as pd
import csv
with open('raw-data1.txt') as fin:
df = pd.get_dummies(pd.DataFrame(csv.reader(fin)).stack()).groupby(level=0).max()
Will give you a df of:
Buns Chips Coke HotDogs Ketchup
0 1 0 0 1 0
1 1 0 0 1 0
2 0 1 1 1 0
3 0 1 1 0 0
4 0 1 0 0 1
5 0 1 1 1 0
.. which you can then write back out as CSV if required.
I asked a question like this. But that is a simple one. Which has been resolved. how to merge strings that have substrings in common to produce some groups in a data frame in Python.
But here, I have an advanced version of the similar question:
I have a sample data:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
What I want to do is merge some strings if they have sub strings in common. So, in this example, the strings 'b,c','a','a,c,d,e' should be merged together because they can be linked to each other. 'j,k,l' and 'k,l,m' should be in one group. In the end, I hope I can have something like:
group
'b,c', 0
'a', 0
'a,c,d,e', 0
'f,g,h,i', 1
'j,k,l', 2
'k,l,m' 2
So, I can have three groups and there is no common sub strings between any two groups.
Now, I am trying to build up a similarity data frame, in which 1 means two strings have sub strings in common. Here is my code:
commonWords=1
for i in np.arange(a.shape[0]):
a.loc[:,a.loc[i,'ACTIVITY']]=0
for i in a.loc[:,'ACTIVITY']:
il=i.split(',')
for j in a.loc[:,'ACTIVITY']:
jl=j.split(',')
c=[x in il for x in jl]
c1=[x for x in c if x==True]
a.loc[(a.loc[:,'ACTIVITY']==i),j]=1 if len(c1)>=commonWords else 0
a
The result is:
ACTIVITY b,c a a,c,d,e f,g,h,i j,k,l k,l,m
0 b,c 1 0 1 0 0 0
1 a 0 1 1 0 0 0
2 a,c,d,e 1 1 1 0 0 0
3 f,g,h,i 0 0 0 1 0 0
4 j,k,l 0 0 0 0 1 1
5 k,l,m 0 0 0 0 1 1
In this code, commonWords means how many sub strings I hope that two strings have in common. For example, if commonWords=2, then two strings will be merged together only if there are two, or more than two sub strings in them. When commonWords=2, the group should be:
group
'b,c', 0
'a', 1
'a,c,d,e', 2
'f,g,h,i', 3
'j,k,l', 4
'k,l,m' 4
Use:
a=pd.DataFrame({'ACTIVITY':['b,c','a','a,c,d,e','f,g,h,i','j,k,l','k,l,m']})
from itertools import combinations, chain
from collections import Counter
#split values by , to lists
splitted = a['ACTIVITY'].str.split(',')
commonWords=2
#create edges (can only connect two nodes)
L2_nested = [list(combinations(l,commonWords)) for l in splitted]
L2 = list(chain.from_iterable(L2_nested))
#convert values to sets
f1 = [set(k) for k, v in Counter(L2).items() if v >= commonWords]
f2 = [set(x) for x in splitted]
#create new columns for matched sets
for val in f1:
j = ','.join(val)
a[j] = [j if len(val & x) == commonWords else np.nan for x in f2]
print (a)
#forward filling values of new columns and use factorize for groups
new = pd.factorize(a[['ACTIVITY']].assign(ACTIVITY = a.index).ffill(axis=1).iloc[:, -1])[0]
a = a[['ACTIVITY']].assign(group = new)
print (a)
ACTIVITY group
0 b,c 0
1 a 1
2 a,c,d,e 2
3 f,g,h,i 3
4 j,k,l 4
5 k,l,m 4
I have a text file which looks like this:
~Date and Time of Data Converting: 15.02.2019 16:12:44
~Name of Test: XXX
~Address: ZZZ
~ID: OPP
~Testchannel: CH06
~a;b;DateTime;c;d;e;f;g;h;i;j;k;extract;l;m;n;o;p;q;r
0;1;04.03.2019 07:54:19;0;0;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;3;0;0;0;0;0
5,5523894132E-7;2;04.03.2019 07:54:19;5,5523894132E-7;5,5523894132E-7;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;0;0;0;0;0;0
0,00277777777779538;3;04.03.2019 07:54:29;0,00277777777779538;0,00277777777779538;2;Pause;3,5724446855812;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,00555555532278617;4;04.03.2019 07:54:39;0,00555555532278617;0,00555555532278617;2;Pause;3,57263521596443;0;0;0;0;24,55957;1;1;0;0;0;0;0
0,00833333333338613;5;04.03.2019 07:54:49;0,00833333333338613;0,00833333333338613;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,0111112040002119;6;04.03.2019 07:54:59;0,0111112040002119;0,0111112040002119;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
0,013888887724954;7;04.03.2019 07:55:09;0,013888887724954;0,013888887724954;2;Pause;3,57263521596443;0;0;0;0;24,55653;1;1;0;0;0;0;0
I need to extract the values from the column named extract, and need to store the output as an excel file.
Can anyone give me any idea how I can proceed?
So far, I have only been able to create an empty excel file for the output, and I have read the text file. I however don't know how to append output to the empty excel file.
import os
file=open('extract.csv', "a")
if os.path.getsize('extract.csv')==0:
file.write(" "+";"+"Datum"+";"+"extract"+";")
with open('myfile.txt') as f:
dat=[f.readline() for x in range(10)]
datum=dat[7].split(' ')[3]
data = np.genfromtxt('myfile.txt', delimiter=';', skip_header=12,dtype=str)
You can use the pandas module.
You need to read skip the first lines of your text file. Here, I consider not to know how many there are. I loop until I find a data row.
Then read the data.
Finaly, export it as dataframe with to_excel (doc)
Here the code:
# Import module
import pandas as pd
# Read file
with open('temp.txt') as f:
content = f.read().split("\n")
# Skip the first lines (find number start data)
for i, line in enumerate(content):
if line and line[0] != '~': break
# Columns names and data
header = content[i - 1][1:].split(';')
data = [row.split(';') for row in content[i:]]
# Store in dataframe
df = pd.DataFrame(data, columns=header)
print(df)
# a b DateTime c d e f ... l m n o p q r
# 0 0 1 04.03.2019 07:54:19 0 0 2 Pause ... 1 3 0 0 0 0 0
# 1 5,5523894132E-7 2 04.03.2019 07:54:19 5,5523894132E-7 5,5523894132E-7 2 Pause ... 1 0 0 0 0 0 0
# 2 0,00277777777779538 3 04.03.2019 07:54:29 0,00277777777779538 0,00277777777779538 2 Pause ... 1 1 0 0 0 0 0
# 3 0,00555555532278617 4 04.03.2019 07:54:39 0,00555555532278617 0,00555555532278617 2 Pause ... 1 1 0 0 0 0 0
# 4 0,00833333333338613 5 04.03.2019 07:54:49 0,00833333333338613 0,00833333333338613 2 Pause ... 1 1 0 0 0 0 0
# 5 0,0111112040002119 6 04.03.2019 07:54:59 0,0111112040002119 0,0111112040002119 2 Pause ... 1 1 0 0 0 0 0
# 6 0,013888887724954 7 04.03.2019 07:55:09 0,013888887724954 0,013888887724954 2 Pause ... 1 1 0 0 0 0 0
# Select only the Extract column
# df = df.Extract
# Save the data in excel file
df.to_excel("OutPut.xlsx", "MySheetName", index=False)
Note: if you know the number of lines to skip, you can simply load the dataframe with read_csv using the skiprows parameter. (doc).
Hope that helps!
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 8 years ago.
Improve this question
I need to convert a genotype dosage file into an allelic dosage file.
Input looks like this:
#snp a1 a2 i1 j1 i2 j2 i3 j3
chr6_24000211_D D I3 0 0 0 0 0 0
rs78244999 A G 1 0 1 0 1 0
rs1511479 T C 0 1 1 0 0 1
rs34425199 A C 0 0 0 0 0 0
rs181892770 A G 1 0 1 0 1 0
rs501871 A G 0 1 0.997 0.003 0 1
chr6_24000836_D D I4 0 0 0 0 0 0
chr6_24000891_I I2 D 0 0 0 0 0 1
rs16888446 A C 0 0 0 0 0 0
Columns 1-3 are identifiers. No operations should be performed on these, they need to just be copied as is into the output file. For the remaining columns, they need to be considered as a pair of column i and column j and the following operation needs to be performed: 2*i + j
Pseudocode
write first three columns of input file to output
for all i and j in the file, write 2*i + j to output
Desired output looks like this:
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
I will be performing this on a number of files with different total columns, so I want the loop to run for (total number of columns - 3)/2 iterations, i.e. until it reaches the last column of the file.
Input files are ~9 million rows by ~10,000 columns, so reading the files into a program such as R is very slow. I am not sure the most efficient tool to use to implement this (awk? perl? python?), and as a novice coder I unsure of where to begin re: syntax for the solution.
Here's the awk implementation of your posted algorithm, enhanced just slightly to produce the first row you show in your expected output:
$ cat tst.awk
{
printf "%s %s %s", $1, $2, $3
c=0
for (i=4; i<NF; i+=2) {
printf " %s", (NR>1 ? 2*$i + $(i+1) : ++c)
}
print ""
}
.
$ awk -f tst.awk file
#snp a1 a2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
Python version
#!/usr/bin/env python
from itertools import izip_longest, chain
def chunk(sequence, chunk_size=2):
"""
list(chunk([1,2,3,4], 2)) => [(1,2),(3,4)]
"""
# Take advantage of the same iterator being consumed
# multiple times/sources to do grouping
return izip_longest(*[iter(sequence)] * chunk_size)
def processor(csv_reader):
for row in csv_reader:
# collect the pairs and process them
processed_pairs = (2*float(i)+float(j) for i, j in chunk(row[3:]))
# yield back the first 3 element and the processed pairs
yield list(i for j in (row[0:3], processed_pairs) for i in j)
if __name__ == '__main__':
import csv, sys
with open(sys.argv[1], 'rb') as csvfile:
source = processor(csv.reader(csvfile, delimiter=' '))
for line in source:
print line
This will do as you ask. It expects the input file as a parameter on the command line and will send the output to STDOUT, which you can redirect to a file if you wish.
use strict;
use warnings;
while (<>) {
my #fields = split;
my #probs = splice #fields, 3;
if (/^#/) {
push #fields, 1 .. #probs / 2;
}
else {
while (#probs >= 2) {
my ($i, $j) = splice #probs, 0, 2;
push #fields, $i + $i + $j;
}
}
print "#fields\n";
}
output
#SNP A1 A2 1 2 3
chr6_24000211_D D I3 0 0 0
rs78244999 A G 2 2 2
rs1511479 T C 1 2 1
rs34425199 A C 0 0 0
rs181892770 A G 2 2 2
rs501871 A G 1 1.997 1
chr6_24000836_D D I4 0 0 0
chr6_24000891_I I2 D 0 0 1
rs16888446 A C 0 0 0
I'm working on a python program to compute a numerical coding of mutated residues and positions of a set of strings (protein sequences), stored in fasta format file with each protein sequence is separated by comma. I'm trying to find the position and sequences which are mutated.
My fasta file is as follows:
MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN
Example:
The following figure (based on another set of fasta file) will explain the algorithm behind this. In this figure first box represents alignment of input file sequences. The last box represents the output file. How can I do this with my fasta file in Python?
example input file:
MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD
positions 1 2 3 4 5 6 1 2 3 4 5 6
protein sequence1 M T A Q D D T A D
protein sequence2 M T A Q D D T A D
protein sequence3 M T S Q E D T S E
protein sequence4 M T A Q D D T A D
protein sequence5 M K A Q H D K A H
PROTEIN SEQUENCE ALIGNMENT DISCARD NON-VARIABLE REGION
positions 2 2 3 3 5 5 5
protein sequence1 T A D
protein sequence2 T A D
protein sequence3 T S E
protein sequence4 T A D
protein sequence5 K A H
MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN
Output file should be like this:
position+residue 2T 2K 3A 3S 5D 5E 5H
sequence1 1 0 1 0 1 0 0
sequence2 1 0 1 0 1 0 0
sequence3 1 0 0 1 0 1 0
sequence4 1 0 1 0 1 0 0
sequence5 0 1 1 0 0 0 1
(RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)
Here are two ways I have tried to do it:
ls= 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
a=set().union(*pos)
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
print '\t'.join('1' if key in p else '0' for key in alle)
(here I'm getting columns of mutated as well as non-mutated residues, but I want only columns for mutated residues)
from pandas import *
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'
df = DataFrame([list(row) for row in data.split(',')])
df = DataFrame({str(col+1)+val:(df[col]==val).apply(int) for col in df.columns for val in set(df[col])})
print df.select(lambda x: not df[x].all(), axis = 1)
(here it is giving output ,but not in orderly ie, first 2K then 2T then 3A like that.)
How should I be doing this?
The function get_dummies gets you most of the way:
In [11]: s
Out[11]:
0 T
1 T
2 T
3 T
4 K
Name: 1
In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]:
1K 1T
0 0 1
1 0 1
2 0 1
3 0 1
4 1 0
And those columns which have differing values:
In [21]: (df.ix[0] != df).any()
Out[21]:
0 False
1 True
2 True
3 False
4 True
5 False
Putting these together:
In [31]: I = df.columns[(df.ix[0] != df).any()]
In [32]: J = [pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I]
In [33]: df[[]].join(J)
Out[33]:
1K 1T 2A 2S 4D 4E 4H
0 0 1 1 0 1 0 0
1 0 1 1 0 1 0 0
2 0 1 0 1 0 1 0
3 0 1 1 0 1 0 0
4 1 0 1 0 0 0 1
Note: I created the initial DataFrame as follows, however this may be done more efficiently depending on your situation:
df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))