Python getting conditional probabilities of all feature combinations in pandas dataframe

Python getting conditional probabilities of all feature combinations in pandas dataframe - python

I have a Pandas data frame with some categorical variables. Something like this -
>>df
'a', 'x'
'a', 'y'
Now, I want to return a matrix with the conditional probabilities of each level appearing with every other level. For the data frame above, it would look like -
[1, 0.5, 0.5],
[1, 1, 0],
[1, 0, 1]
The three entries correspond to the levels 'a', 'x' and 'y'.
This is because conditional on the first column being 'a', the probabilities of 'x' and 'y' appearing are 0.5 each and so on.
I have some code that does this (below). However, the problem is that it is excruciatingly slow. So slow that the application I want to use it in times out. Does anyone have any tips to make it faster?
df = pd.read_csv('pathToData.csv')
df = df.fillna("null")
cols = 0
col_levels = []
columns = {}
num = 0
for i in df.columns:
cols += len(set(df[i]))
col_levels.append(np.sort(list(set(df[i]))))
for j in np.sort(list(set(df[i]))):
columns[i + '_' + str(j)] = num
num += 1
res = np.eye(cols)
for i in range(len(df.columns)):
for j in range(len(df.columns)):
if i != j:
row_feature = df.columns[i]
col_feature = df.columns[j]
rowLevels = col_levels[i]
colLevels = col_levels[j]
for ii in rowLevels:
for jj in colLevels:
frst = (df[row_feature] == ii) * 1
scnd = (df[col_feature] == jj) * 1
prob = sum(frst*scnd)/(sum(frst) + 1e-9)
frst_ind = columns[row_feature + '_' + ii]
scnd_ind = columns[col_feature + '_' + jj]
res[frst_ind, scnd_ind] = prob
EDIT: Here is a bigger example:
>>df
'a', 'x', 'l'
'a', 'y', 'l'
'b', 'x', 'l'
The number of distinct categories here are 'a', 'b', 'x', 'y' and 'l'. Since these are 5 categories, the output matrix should be 5x5. The first row and first column would be how often does 'a' appear conditional on 'a'. This is of course, 1 (as are all the diagonals). The first row and second column is conditional on 'a', what is the probability of 'b'. Since 'a' and 'b' are parts of the same column, this is zero. The first row and third column is the probability of 'x' conditional on 'a'. We see that 'a' appears twice but only once with 'x'. So, this probability is 0.5. And so on.

The way I approach the problem is to first calculate all unique levels in the dataset. Then loop through a cartesian product of those levels. At each step, filter the dataset to create a subset where condition is True. Then, count the number of rows in the subset where the event has happened. Below is my code.
import pandas as pd
from itertools import product
from collections import defaultdict
df = pd.DataFrame({
'col1': ['a', 'a', 'b'],
'col2': ['x', 'y', 'x'],
'col3': ['l', 'l', 'l']
})
levels = df.stack().unique()
res = defaultdict(dict)
for event, cond in product(levels, levels):
# create a subset of rows with at least one element equal to cond
conditional_set = df[(df == cond).any(axis=1)]
conditional_set_size = len(conditional_set)
# count the number of rows in the subset where at least one element is equal to event
conditional_event_count = (conditional_set == event).any(axis=1).sum()
res[event][cond] = conditional_event_count / conditional_set_size
result_df = pd.DataFrame(res)
print(result_df)
# OUTPUT
# a b l x y
# a 1.000000 0.000000 1.0 0.500000 0.500000
# b 0.000000 1.000000 1.0 1.000000 0.000000
# l 0.666667 0.333333 1.0 0.666667 0.333333
# x 0.500000 0.500000 1.0 1.000000 0.000000
# y 1.000000 0.000000 1.0 0.000000 1.000000
I am sure there are other faster methods, but it is the first thing that comes to my mind.

Related

Python add weights associated with values of a column

I am working with an ex termly large datfarem. Here is a sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID': ['A', 'A', 'A', 'X', 'X', 'Y'],
})
ID
0 A
1 A
2 A
3 X
4 X
5 Y
Now, given the frequency of each value in column '''ID''', I want to calculate a weight using the function below and add a column that has the weight associated with each value in '''ID'''.
def get_weights_inverse_num_of_samples(label_counts, power=1.):
no_of_classes = len(label_counts)
weights_for_samples = 1.0/np.power(np.array(label_counts), power)
weights_for_samples = weights_for_samples/ np.sum(weights_for_samples)*no_of_classes
return weights_for_samples
freq = df.value_counts()
print(freq)
ID
A 3
X 2
Y 1
weights = get_weights_inverse_num_of_samples(freq)
print(weights)
[0.54545455 0.81818182 1.63636364]
So, I am looking for an efficient way to get a dataframe like this given the above weights:
ID sample_weight
0 A 0.54545455
1 A 0.54545455
2 A 0.54545455
3 X 0.81818182
4 X 0.81818182
5 Y 1.63636364

If you rely on duck-typing a little bit more, you can rewrite your function to return the same input type as outputted.
This will save you of needing to explicitly reaching back into the .index prior to calling .map
import pandas as pd
df = pd.DataFrame({'ID': ['A', 'A', 'A', 'X', 'X', 'Y'})
def get_weights_inverse_num_of_samples(label_counts, power=1):
"""Using object methods here instead of coercing to numpy ndarray"""
no_of_classes = len(label_counts)
weights_for_samples = 1 / (label_counts ** power)
return weights_for_samples / weights_for_samples.sum() * no_of_classes
# select the column before using `.value_counts()`
# this saves us from ending up with a `MultiIndex` Series
freq = df['ID'].value_counts()
weights = get_weights_inverse_num_of_samples(freq)
print(weights)
# A 0.545455
# X 0.818182
# Y 1.636364
# note that now our weights are still a `pd.Series`
# that we can align directly against our `"ID"` column
df['sample_weight'] = df['ID'].map(weights)
print(df)
# ID sample_weight
# 0 A 0.545455
# 1 A 0.545455
# 2 A 0.545455
# 3 X 0.818182
# 4 X 0.818182
# 5 Y 1.636364

You can map the values:
df['sample_weight'] = df['ID'].map(dict(zip(freq.index.get_level_values(0), weights)))
NB. value_counts returns a MultiIndex with a single level, thus the needed get_level_values.
As noted by #ScottBoston, a better approach would be to use:
freq = df['ID'].value_counts()
df['sample_weight'] = df['ID'].map(dict(zip(freq.index, weights)))
Output:
ID sample_weight
0 A 0.545455
1 A 0.545455
2 A 0.545455
3 X 0.818182
4 X 0.818182
5 Y 1.636364

Given a dataframe, how do I bucket columns according to their names and merge columns in the same bucket into one?

Suppose I have a dataframe with (for example) 10 columns: a,b,c,d,e,f,g,h,i,j
I want to bucket these columns as follows: a,b,c into x, d,f,g into y, e,h,i into z and j into j.
Each row of the output will have the x column value equal to the non-NaN a or b or c value of the original df. In case of multiple non-NaN values for a,b,c columns for a particular row in the original df, the output df will just contain a list of those non-NaN values.
To give an example, if the original df is (- just means NaN to save typing effort):
a b c d e f g h i j
0 1 - - - 2 - 4 3 - -
1 - 6 - 0 4 - - - - 2
2 - 3 2 - - - - 1 - 9
The output will be:
x y z j
0 1 4 [2,3] -
1 6 0 4 2
2 [3,2] - 1 9
Is there an efficient way of doing this? I'm not even able to get started using conventional methods.

one way is to create a dictionary with your mappings, apply your column names, stack and to apply your groupby operation and unstack to your original shape.
I couldn't see any logic in your mappings so it will have to be a manual operation I'm afraid.
buckets = {'x': ['a', 'b', 'c'], 'y': ['d', 'f', 'g'], 'z': ['e', 'h', 'i'], 'j': 'j'}
df.columns = df.columns.map( {i : x for x,y in buckets.items() for i in y})
out = df.stack().groupby(level=[0,1]).agg(list).unstack(1)[buckets.keys()]
print(out)
x y z j
0 [1] [4] [2, 3] NaN
1 [6] [0] [4] [2]
2 [3, 2] NaN [1] [9]

First create the dict for mapping , the groupby
d = {'a':'x','b':'x','c':'x','d':'y','f':'y','g':'y','e':'z','h':'z','i':'z','j':'j'}
out = df.groupby(d,axis=1).agg(lambda x : [y[y!='-']for y in x.values])
Out[138]:
j x y z
0 [] [1] [4] [2, 3]
1 [2] [6] [0] [4]
2 [9] [3, 2] [] [1]

Starting with a very basic approach, let's define our buckets and simply iterate, then clean up:
buckets = {
'x': ['a', 'b', 'c'],
'y': ['d', 'e', 'f'],
'z': ['g', 'h', 'i'],
'j': ['j']
}
def clean(val):
val = [x for x in val if not np.isnan(val)]
if len(val) == 0:
return np.nan
elif len(val) == 1:
return val[0]
else:
return val
new_df = pd.DataFrame()
for new_col, old_cols in buckets.items():
new_df[key] = df[old_cols].values.tolist().apply(clean)

Here's how you can do it.
First, we define a method to perform the row-wise bucketing operation.
def bucket_rows(row):
row = row.dropna().to_list()
if len(row) == 0:
row = [np.nan]
return row
Then, we can use the pandas.DataFrame.apply method to map this function onto each row on a dataframe (here, a sub-dataframe, if you will, since we'll get the sub-df using the column names).
I have implemented everything in the following code snippet.
import numpy as np
import pandas as pd
bucket_cols=[["a", "b", "c"], ["d", "f", "g"], ["e", "h","i"], ["j"]]
bucket_names=["x", "y", "z", "j"]
buckets = {}
def bucket_rows(row):
row = row.dropna().to_list() # applying pd.Series.dropna method to remove NaN values
# if the list is empty, populate it with NaN
if len(row) == 0:
row = [np.nan]
# returns bucketed row
return row
# looping through buckets and perforing bucketing operation
for idx, cols in enumerate(bucket_cols):
bucket = df[cols].apply(bucket_rows, axis=1).to_list()
buckets[idx] = bucket
# creating bucketted df from buckets dict
df_bucketted = pd.DataFrame(buckets)

Iterate in a dataframe with strings

I'm trying to create a cognitive task named 2-backed test.
I created a semi-random list with certain conditions and now I wanted to know what should be the good answer for the participant.
I want a column in my dataframe saying if yes or no, 2 letters before it was the same letter.
Here is my code :
from random import choice, shuffle
import pandas as pd
num = 60
letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
# letters_1 = [1, 2, 3, 4, 5, 6]
my_list = [choice(letters), choice(letters)]
probab = list(range(num - 2))
shuffle(probab)
# We want 20% of the letters to repeat the letter 2 letters back
pourc = 20
repeatnum = num * pourc // 100
for i in probab:
ch = prev = my_list[-2]
if i >= repeatnum:
while ch == prev:
ch = choice(letters)
my_list.append(ch)
df = pd.DataFrame(my_list, columns=["letters"])
df.head(10)
letters
0 F
1 I
2 D
3 I
4 H
5 C
6 L
7 G
8 D
9 L
# Create a list to store the data
response = []
# For each row in the column,
for i in df['letters']:
# if more than a value,
if i == [i - 2]:
response.append('yes')
else:
response.append('no')
# Create a column from the list
df['response'] = response
First error :
if i == [i - 2]:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I use numbers instead of letters, I can get over this error, but I would prefer keeping letters..
But after that if I run it with number, I get no errors, but my new column response only have 'no'. But I know that 12 times it should be 'yes'.

It seems like you want to perform a comparison on the column and the same column shifted by two elements. Use shift + np.where -
df['response'] = np.where(df.letters.eq(df.letters.shift(2)), 'yes', 'no')
df.head(10)
letters response
0 F no
1 I no
2 D no
3 I yes
4 H no
5 C no
6 L no
7 G no
8 D no
9 L no
But I know that 12 times it should be 'yes'.
df.response.eq('yes').sum()
12

pandas / dask calculate percentages for multiple columns - column-parallel operation

When I have a data frame in pandas like:
raw_data = {
'subject_id': ['1', '2', '3', '4', '5'],
'name': ['A', 'B', 'C', 'D', 'E'],
'nationality': ['DE', 'AUT', 'US', 'US', 'US'],
'alotdifferent': ['x', 'y', 'z', 'x', 'a'],
'target': [0,0,0,1,1],
'age_group' : [1, 2, 1, 3, 1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'alotdifferent','target','age_group'])
df_a.nationality = df_a.nationality.astype('category')
df_a.alotdifferent = df_a.alotdifferent.astype('category')
df_a.name = df_a.name.astype('category')
Currently, I use:
FACTOR_FIELDS = df_a.select_dtypes(include=['category']).columns
columnsToDrop = ['alotdifferent']
columnsToBias_keep = FACTOR_FIELDS[~FACTOR_FIELDS.isin(columnsToDrop)]
target = 'target'
def quotients_slow(df_a):
# parallelism = 8
# original = dd.from_pandas(df.copy())
original = df_a.copy()
output_df = original
ratio_weights = {}
for colname in columnsToBias_keep.union(columnsToDrop):
# group only a single time
grouped = original.groupby([colname, target]).size()
# calculate first ratio
df = grouped / original[target].sum()
nameCol = "pre_" + colname
grouped_res = df.reset_index(name=nameCol)
grouped_res = grouped_res[grouped_res[target] == 1]
grouped_res = grouped_res.drop(target, 1)
# todo persist the result in dict for transformer
result_1 = grouped_res
# calculate second ratio
df = (grouped / grouped.groupby(level=0).sum())
nameCol_2 = "pre2_" + colname
grouped = df.reset_index(name=nameCol_2)
grouped_res = grouped[grouped[target] == 1]
grouped_res = grouped_res.drop(target, 1)
result_2 = grouped_res
# persist the result in dict for transformer
# this is required to separate fit and transform stage (later on in a sklearn transformer)
ratio_weights[nameCol] = result_1
ratio_weights[nameCol_2] = result_2
# retrieve results
res_1 = ratio_weights['pre_' + colname]
res_2 = ratio_weights['pre2_' + colname]
# merge ratio_weight with original dataframe
output_df = pd.merge(output_df, res_1, on=colname, how='left')
output_df = pd.merge(output_df, res_2, on=colname, how='left')
output_df.loc[(output_df[nameCol].isnull()), nameCol] = 0
output_df.loc[(output_df[nameCol_2].isnull()), nameCol_2] = 0
if colname in columnsToDrop:
output_df = output_df.drop(colname, 1)
return output_df
quotients_slow(df_a)
to calculate the ratio of each group to target:1 for each (categorical) column in two ways. As I want to perform the this operation for multiple columns, I naively iterating all of them. But this operation is very slow.
Here in the sample: 10 loops, best of 3: 37 ms per loop. For my real dataset of around 500000 rows and around 100 columns this really takes a while.
Shouldn't it be possible to speed it up (column parallel manner, trivial parallelization) in either dask or pandas? Is there a possibility to implement it more efficiently in plain pandas? Is it possible to reduce the number of passes over the data for computing the quotients?
edit
when trying to use dask.delayed in the for loop to achieve parallelism over the columns, I can't figure out how to build the graph over the columns, as I need to call compute to get the tuples.
delayed_res_name = delayed(compute_weights)(df_a, 'name')
a,b,c,d = delayed_res_name.compute()
ratio_weights = {}
ratio_weights[c] = a
ratio_weights[d] = b

Here's a reasonably fast solution for your first quotient, using Pandas. It assumes you are not interested in computing proportions for subject_id. I also added some data to your example to cover more edge cases.
First, generate sample data:
raw_data = {
'subject_id': ['1', '2', '3', '4', '5', '6','7'],
'name': ['A', 'B', 'C', 'D', 'E', 'A','A'],
'nationality': ['DE', 'AUT', 'US', 'US', 'US', 'DE','DE'],
'alotdifferent': ['x', 'y', 'z', 'x', 'a','x','z'],
'target': [0,0,0,1,1,0,1],
'age_group' : [1, 2, 1, 3, 1, 2,1]}
df_a = pd.DataFrame(raw_data, columns = ['subject_id', 'name', 'nationality', 'alotdifferent','target','age_group'])
Now compute proportions and measure speed:
def compute_prop(group):
return group.sum() / float(group.count())
def build_master(df):
master = df.copy()
fields = df.drop(['subject_id','target'],1).columns
for field in fields:
master = (pd.merge(master, df.groupby(field, as_index=False)
.agg({'target':compute_prop})
.rename(columns={'target':'pre_{}'.format(field)}),
on=field)
)
master.sort_values('subject_id')
return master
%timeit master = build_master(df_a)
10 loops, best of 3: 17.1 ms per loop
Output:
subject_id name nationality alotdifferent target age_group pre_name \
0 1 A DE x 0 1 0.333333
5 2 B AUT y 0 2 0.000000
2 3 C US z 0 1 0.000000
6 4 D US x 1 3 1.000000
3 5 E US a 1 1 1.000000
4 6 A DE x 0 2 0.333333
1 7 A DE z 1 1 0.333333
pre_nationality pre_alotdifferent pre_age_group
0 0.333333 0.333333 0.5
5 0.000000 0.000000 0.0
2 0.666667 0.500000 0.5
6 0.666667 0.333333 1.0
3 0.666667 1.000000 0.5
4 0.333333 0.333333 0.0
1 0.333333 0.500000 0.5

Pandas: for loop through columns

My data looks like:
SNP Name ss715583617 ss715592335 ss715591044 ss715598181
4 PI081762 T A A T
5 PI101404A T A A T
6 PI101404B T A A T
7 PI135624 T A A T
8 PI326581 T A A T
9 PI326582A T A A T
10 PI326582B T A A T
11 PI339732 T A A T
12 PI339735A T A A T
13 PI339735B T A A T
14 PI342618A T A A T
In reality I have a dataset of 50,000 columns of 479 rows. My objective is to go through each column with characters and convert the data to integers depending on which is the most abundant character.
Right now I have the data input, and I have more or less written the function I would like to use to analyze each column separately. However, I can't quite understand how to use a forloop or use the apply function through all of the columns in the dataset. I would prefer not to hardcode the columns because I will have 40,000~50,000 columns to analyze.
My code so far is:
import pandas as pd
df = pd.read_csv("/home/dfreese/Desktop/testSNPtext", delimiter='\t')
df.head() # check that the file format fits
# ncol df
df2 = df.iloc[4:-1] # Select the rows you want to analyze in a subset df
print(df2)
My function:
def countAlleles(N):
# N is just suppose to be the column, ideally once I've optimized the function
# I need to analyze every column
# Will hold the counts of each letter in the column
letterCount = []
# This is a parallel array to know the order
letterOrder = {'T','A','G','C','H','U'}
# Boolean to use which one is the maximum
TFlag = None
AFlag = None
GFlag = None
CFlag = None
HFlag = None
UFlag = None
# Loop through the column to determine which one is the maximum
for i in range(len(N)): # How do I get index information of the column?
if(N[i] == 'T'): # If the element in the column is T
letterCount[0] = letterCount[0] + 1
elif(N[i] == 'A'):
letterCount[1] = letterCount [1] + 1
elif (N[i] == 'G'):
letterCount[2] = letterCount [2] + 1
elif (N[i] == 'C'):
lettercount[3] = letterCount[3] + 1
elif(N[i] == 'H'):
letterCount[4] = letterCount[4] + 1
else:
letterCount[5] = letterCount[5] + 1
max = letterCount[0] # This will hold the value of maximum
mIndex = 0 # This holds the index position with the max value
# Determine which one is max
for i in range(len(letterCount)):
if (letterCount[i] > max):
max = letterCount[i]
mIndex = i
So I designed the function to input the column, in hopes to be able to iterate through all the columns of the dataframe. My main question is:
1) How would I pass each in each column as a parameter to the for loop through the elements of each column?
My major source of confusion is how indexes are being used in pandas. I'm familiar with 2-dimensional array in C++ and Java and that is most of where my knowledge stems from.
I'm attempting to use the apply function:
df2 = df2.apply(countAlleles('ss715583617'), axis=2)
but it doesn't seem that my application is correct.

Updated answer: Now the dataframe is analyzed and replaced with the int values according to the occurences of a allele per column. The problem with what happens if one allele has the same number of occurences than the other is still the same - the assignment will be not unique.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],
"ss2": ["G", "G", "T", "A"],
"ss3": ["C", "H", "C", "H"]})
letterOrder = np.array(['T', 'A', 'G', 'C', 'H', 'U'])
for col in df:
alleles = list()
for num, allele in enumerate(letterOrder):
alleles.append(df[col].str.count(allele).sum())
# dictionary with full sorted keys
repl = letterOrder[np.argsort(alleles)][::-1]
# directly replace chars by value
for num, char in enumerate(repl):
df[col].replace(char, num+1, inplace=True)
print(df)
This will change the initial dataframe
ss1 ss2 ss3
0 T G C
1 T G H
2 T T C
3 G A H
to the new dataframe with ints sorted according to the number of occurences:
ss1 ss2 ss3
0 1 1 2
1 1 1 1
2 1 3 2
3 2 2 1
For reference the old answer which gives the maximum column indices:
import pandas as pd
import numpy as np
from collections import OrderedDict
df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],
"ss2": ["G", "G", "T", "A"],
"ss3": ["C", "H", "C", "H"]})
letterOrder = ['T', 'A', 'G', 'C', 'H', 'U']
full_results = OrderedDict()
for col in df:
alleles = list()
for num, allele in enumerate(letterOrder):
alleles.append(df[col].str.count(allele).sum())
full_results[col] = [letterOrder[np.argmax(alleles)], np.max(alleles)]
print(full_results)
This will give:
OrderedDict([('ss1', ['T', 3]), ('ss2', ['G', 2]), ('ss3', ['C', 2])])
The key in the dict is the name of your column, and the value is a list with [allele, number_of_occurences].
I used OrderedDict to keep the order of your columns and the name, but if you don't need the order, you can use a dict, or if you don't need the column name (and the implicit ID is enough), use a list.
But be careful: If in one column two (or more) characters have the same number of counts, this will only return one of them. You would need to add an additional test for this.

To iterate over columns in e.g. a for loop, use list(df). Anyhow, you can easily do what you are attempting using collections.Counter
assume a dataframe df
df
# Name ss715583617 ss715592335 ss715591044 ss715598181
#0 PI081762 T A A T
#1 PI101404A T A A T
#2 PI101404B T A A T
#3 PI135624 T A A T
#4 PI326581 T A F D
#5 PI326582A G A F T
#6 PI326582B G A A T
#7 PI339732 D H A T
#8 PI339735A D A A T
#9 PI339735B A A A T
#10 PI342618A D A A T
What I gather from the comments sections and your original post, you want to replace each character in each column according to it's frequency of occurrence. This is one approach:
Make the Counters
from collections import Counter
cols = [ col for col in list(df) if col not in ['Name'] ] # all the column you want to operate on
col_counters = { col: Counter( df[col] ) for col in cols }
#{'ss715583617': Counter({'T': 5, 'D': 3, 'G': 2, 'A': 1}),
# 'ss715591044': Counter({'A': 9, 'F': 2}),
# 'ss715592335': Counter({'A': 10, 'H': 1}),
# 'ss715598181': Counter({'T': 10, 'D': 1})}
Sort the items in each Counter
sort_func = lambda items: sorted(items, key=lambda x:x[1], reverse=True ) # sort a nested list according to second element in each sublist
sort_result = { col: sort_func(counter.items()) for col,counter in col_counters.iteritems() }
#{'ss715583617': [('T', 5), ('D', 3), ('G', 2), ('A', 1)],
# 'ss715591044': [('A', 9), ('F', 2)],
# 'ss715592335': [('A', 10), ('H', 1)],
# 'ss715598181': [('T', 10), ('D', 1)]}
Replace letters in dataframe according to sort result
Here we will use enumerate to get the position of each sort result
mapper = { col: {letter:i+1 for i,letter in enumerate(sort_result[col]) } for col in sort_result }
#{'ss715583617': {'A': 4, 'D': 2, 'G': 3, 'T': 1},
# 'ss715591044': {'A': 1, 'F': 2},
# 'ss715592335': {'A': 1, 'H': 2},
# 'ss715598181': {'D': 2, 'T': 1}}
df.replace( to_replace=mapper, inplace=True)
# Name ss715583617 ss715592335 ss715591044 ss715598181
#0 PI081762 1 1 1 1
#1 PI101404A 1 1 1 1
#2 PI101404B 1 1 1 1
#3 PI135624 1 1 1 1
#4 PI326581 1 1 2 2
#5 PI326582A 3 1 2 1
#6 PI326582B 3 1 1 1
#7 PI339732 2 2 1 1
#8 PI339735A 2 1 1 1
#9 PI339735B 4 1 1 1
#10 PI342618A 2 1 1 1
This should be enough to get you on your way. I am not sure how you want to handle duplicate elements, for instance if a column has the same number if T and G.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python getting conditional probabilities of all feature combinations in pandas dataframe - python

Related

Python add weights associated with values of a column

Given a dataframe, how do I bucket columns according to their names and merge columns in the same bucket into one?

Iterate in a dataframe with strings

pandas / dask calculate percentages for multiple columns - column-parallel operation

Pandas: for loop through columns

Categories

Resources