Python: Change row and colum in a matrix - python

matrix = []
for index, value in enumerate(['A','C','G','T']):
matrix.append([])
matrix[index].append(value + ':')
for i in range(len(lines[0])):
total = 0
for sequence in lines:
if sequence[i] == value:
total += 1
matrix[index].append(total)
unity = ''
for i in range(len(lines[0])):
column = []
for row in matrix:
column.append(row[1:][i])
maximum = column.index(max(column))
unity += ['A', 'C', 'G', 'T'][maximum]
print("Unity: " + unity)
for row in matrix:
print(' '.join(map(str, row)))
OUTPUT:
Unity: GGCTACGC
A: 1 2 0 2 3 2 0 0
C: 0 1 4 2 1 3 2 4
G: 3 3 2 0 1 2 4 1
T: 3 1 1 3 2 0 1 2
With this code I get this matrix but I want to form the matrix like this:
A C G T
G: 1 0 3 3
G: 2 1 3 1
C: 0 4 2 1
T: 2 2 0 3
A: 3 1 1 2
C: 2 3 2 0
G: 0 2 4 1
C: 0 4 1 2
But I don't know how. I hope someone can help me. Thanks already for the answers.
The sequences are:
AGCTACGT
TAGCTAGC
TAGCTACG
GCTAGCGC
TGCTAGCC
GGCTACGT
GTCACGTC

You're needing to do a transpose of your matrix. I've added comments in the code below to explain what has been changed to make the table.
matrix = []
for index, value in enumerate(['A','C','G','T']):
matrix.append([])
# Don't put colons in column headers
matrix[index].append(value)
for i in range(len(lines[0])):
total = 0
for sequence in lines:
if sequence[i] == value:
total += 1
matrix[index].append(total)
unity = ''
for i in range(len(lines[0])):
column = []
for row in matrix:
column.append(row[1:][i])
maximum = column.index(max(column))
unity += ['A', 'C', 'G', 'T'][maximum]
# Tranpose matrix
matrix = list(map(list, zip(*matrix)))
# Print header with tabs to make it look pretty
print( '\t'+'\t'.join(matrix[0]))
# Print rows in matrix
for row,unit in zip(matrix[1:],unity):
print(unit + ':\t'+'\t'.join(map(str, row)))
The following will be printed:
A C G T
G: 1 0 3 3
G: 2 1 3 1
C: 0 4 2 1
T: 2 2 0 3
A: 3 1 1 2
C: 2 3 2 0
G: 0 2 4 1
C: 0 4 1 2

I think that the best way is to convert your matrix to pandas dataframe and to then use transpose function.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.transpose.html

Related

Can multiple table be formed from a condition in dataframe in python?

I have a dataset:
list1 list2
0 [1,3,4] [4,3,2]
1 [1,3,2] [0,4,6]
2 [4,5,8] NA
3 [6,3,7] [8,2,3]
Is there a process where i can find the count of the common term for- each of the index,
Expected output:
intersection_0, it will compare 0 of list1 with each of list2 and give output, intersection_1 which will compare 1 of list1 with each of list2
Expected_output:
Intersection_0 intersection_1 intersection_2 intersection_3
1 2 1 1
1 0 1 1
0 0 0 0
1 2 0 1
For intersection i was trying:
df['intersection'] = [len(set(a).intersection(b)) for a, b in zip(df1.list1, df1.list2)]
Is there a better way or faster way to achieve this? Thank you in advance
The double loop would go like this:
intersections = []
for l2 in df['list2']:
intersection = []
for l1 in df['list1']:
try:
i = len(np.intersect1d(l1,l2))
except:
i = 0
intersection.append(i)
intersections.append(intersection)
out = (pd.DataFrame(intersections))
Output:
0 1 2 3
0 2 2 1 1
1 1 0 1 1
2 0 0 0 0
3 1 2 1 1

Conditional sum of non zero values

I have a daraframe as below:
Datetime Data Fn
0 18747.385417 11275.0 0
1 18747.388889 8872.0 1
2 18747.392361 7050.0 0
3 18747.395833 8240.0 1
4 18747.399306 5158.0 1
5 18747.402778 3926.0 0
6 18747.406250 4043.0 0
7 18747.409722 2752.0 1
8 18747.420139 3502.0 1
9 18747.423611 4026.0 1
I want to calculate the sum of continious non zero values of Column (Fn)
I want my result dataframe as below:
Datetime Data Fn Sum
0 18747.385417 11275.0 0 0
1 18747.388889 8872.0 1 1
2 18747.392361 7050.0 0 0
3 18747.395833 8240.0 1 1
4 18747.399306 5158.0 1 2 <<<
5 18747.402778 3926.0 0 0
6 18747.406250 4043.0 0 0
7 18747.409722 2752.0 1 1
8 18747.420139 3502.0 1 2
9 18747.423611 4026.0 1 3
You can use groupby() and cumsum():
groups = df.Fn.eq(0).cumsum()
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
Details
First use df.Fn.eq(0).cumsum() to create pseudo-groups of consecutive non-zeros. Each zero will get a new id while consecutive non-zeros will keep the same id:
groups = df.Fn.eq(0).cumsum()
# groups Fn (Fn added just for comparison)
# 0 1 0
# 1 1 1
# 2 2 0
# 3 2 1
# 4 2 1
# 5 3 0
# 6 4 0
# 7 4 1
# 8 4 1
# 9 4 1
Then group df.Fn.ne(0) on these pseudo-groups and cumsum() to generate the within-group sequences:
df['Sum'] = df.Fn.ne(0).groupby(groups).cumsum()
# Datetime Data Fn Sum
# 0 18747.385417 11275.0 0 0
# 1 18747.388889 8872.0 1 1
# 2 18747.392361 7050.0 0 0
# 3 18747.395833 8240.0 1 1
# 4 18747.399306 5158.0 1 2
# 5 18747.402778 3926.0 0 0
# 6 18747.406250 4043.0 0 0
# 7 18747.409722 2752.0 1 1
# 8 18747.420139 3502.0 1 2
# 9 18747.423611 4026.0 1 3
How about using cumsum and reset when value is 0
df['Fn2'] = df['Fn'].replace({0: False, 1: True})
df['Fn2'] = df['Fn2'].cumsum() - df['Fn2'].cumsum().where(df['Fn2'] == False).ffill().astype(int)
df
You can store the fn column in a list and then create a new list and iterate over the stored fn column and check the previous index value if it is greater than zero then add it to current index else do not update it and after this u can make a dataframe for the list and concat column wise to existing dataframe
fn=df[Fn]
sum_list[0]=fn first value
for i in range(1,lenghtofthe column):
if fn[i-1]>0:
sum_list.append(fn[i-1]+fn[i])
else:
sum_list.append(fn[i])
dfsum=pd.Dataframe(sum_list)
df=pd.concat([df,dfsum],axis=1)
Hope this will help you.there may me syntax errors that you can refer google.But the idea is this
try this:
sum_arr = [0]
for val in df['Fn']:
if val > 0:
sum_arr.append(sum_arr[-1] + 1)
else:
sum_arr.append(0)
df['sum'] = sum_arr[1:]
df

Get the sum of rows that contain 0 as a value

I want to know how can I make the source code of the following problem based on Python.
I have a dataframe that contain this column:
Column X
1
0
0
0
1
1
0
0
1
I want to create a list b counting the sum of successive 0 value for getting something like that :
List X
1
3
3
3
1
1
2
2
1
If I understand your question correctly, you want to replace all the zeros with the number of consecutive zeros in the current streak, but leave non-zero numbers untouched. So
1 0 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0
becomes
1 4 4 4 4 1 1 1 1 2 2 1 1 1 5 5 5 5 5
To do that, this should work, assuming your input column (a pandas Series) is called x.
result = []
i = 0
while i < len(x):
if x[i] != 0:
result.append(x[i])
i += 1
else:
# See how many times zero occurs in a row
j = i
n_zeros = 0
while j < len(x) and x[j] == 0:
n_zeros += 1
j += 1
result.extend([n_zeros] * n_zeros)
i += n_zeros
result
Adding screenshot below to make usage clearer

Python: how to find values in a dataframe without loop?

I have two dataframes net and M.
net =
i j d
0 5 3 3
1 2 0 2
2 3 2 1
3 4 5 2
4 0 1 3
5 0 3 4
M =
0 1 2 3 4 5
0 0 3 2 4 1 5
1 3 0 2 0 3 3
2 2 2 0 1 1 4
3 4 0 1 0 3 3
4 1 3 1 3 0 2
5 5 3 4 3 2 0
I want to find in M the same values of net['d'], choose randomly a cell in M and create a new dataframe containing the coordinate of that cell. For instance
net['d'][0] = 3
so in M I find:
M[0][1]
M[1][0]
M[1][4]
M[1][5]
...
Finally net1 would be something like that
net1 =
i1 j1 d1
0 1 5 3
1 5 4 2
2 2 3 1
3 1 2 2
4 1 5 3
5 3 0 4
This what I am doing:
I1 = []
J1 = []
for i in net.index:
tmp = net['d'][i]
ds = np.where( M == tmp)
size = len(ds[0])
ind = randint(size) ## find two random locations with distance ds
h = ds[0][ind]
w = ds[1][ind]
I1.append(h)
J1.append(w)
net1 = pd.DataFrame()
net1['i1'] = I1
net1['j1'] = J1
net1['d1'] = net['d']
I am wondering which is the best way to avoid that loop
You can stack the columns of M and then just sample it with replacement
net = pd.DataFrame({'i':[5,2,3,4,0,0],
'j':[3,0,2,5,1,3],
'd':[3,2,1,2,3,4]})
M = pd.DataFrame({0:[0,3,2,4,1,5],
1:[3,0,2,0,3,3],
2:[2,2,0,1,1,4],
3:[4,0,1,0,3,3],
4:[1,3,1,3,0,2],
5:[5,3,4,3,2,0]})
def random_net(net, M):
# make long table and randomize order of rows and rename columns
net1 = M.stack().reset_index()
net1.columns =['i1', 'j1', 'd1']
# get size of each group for random mapping
net1_id_length = net1.groupby('d1').size()
# add id column to uniquely identify row in net
net_copy = net.copy()
# first map gets size of each group and second gets random integer
net_copy['id'] = net_copy['d'].map(net1_id_length).map(np.random.randint)
net1['id'] = net1.groupby('d1').cumcount()
# make for easy lookup
net_copy = net_copy.set_index(['d', 'id'])
net1 = net1.set_index(['d1', 'id'])
# choose from net1 only those from original net
return net1.reindex(net_copy.index).reset_index('d').reset_index(drop=True).rename(columns={'d':'d1'})
random_net(net, M)
output
d1 i1 j1
0 3 5 1
1 2 0 2
2 1 3 2
3 2 1 2
4 3 3 5
5 4 0 3
Timings on 6 million rows
n = 1000000
net = pd.DataFrame({'i':[5,2,3,4,0,0] * n,
'j':[3,0,2,5,1,3] * n,
'd':[3,2,1,2,3,4] * n})
M = pd.DataFrame({0:[0,3,2,4,1,5],
1:[3,0,2,0,3,3],
2:[2,2,0,1,1,4],
3:[4,0,1,0,3,3],
4:[1,3,1,3,0,2],
5:[5,3,4,3,2,0]})
%timeit random_net(net, M)
1 loop, best of 3: 13.7 s per loop

Identifying consecutive occurrences of a value in a column of a pandas DataFrame

I have a df like so:
Count
1
0
1
1
0
0
1
1
1
0
and I want to return a 1 in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:
Count New_Value
1 0
0 0
1 1
1 1
0 0
0 0
1 1
1 1
1 1
0 0
I am thinking I may need to use itertools but I have been reading about it and haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.
You could:
df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
to get:
Count consecutive
0 1 1
1 0 0
2 1 2
3 1 2
4 0 0
5 0 0
6 1 3
7 1 3
8 1 3
9 0 0
From here you can, for any threshold:
threshold = 2
df['consecutive'] = (df.consecutive > threshold).astype(int)
to get:
Count consecutive
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0
or, in a single step:
(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
In terms of efficiency, using pandas methods provides a significant speedup when the size of the problem grows:
df = pd.concat([df for _ in range(1000)])
%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
1000 loops, best of 3: 1.47 ms per loop
compared to:
%%timeit
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
pd.Series(l)
10 loops, best of 3: 76.7 ms per loop
Not sure if this is optimized, but you can give it a try:
from itertools import groupby
import pandas as pd
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
df['new_Value'] = pd.Series(l)
df
Count new_Value
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0

Categories

Resources