My data looks like:
SNP Name ss715583617 ss715592335 ss715591044 ss715598181
4 PI081762 T A A T
5 PI101404A T A A T
6 PI101404B T A A T
7 PI135624 T A A T
8 PI326581 T A A T
9 PI326582A T A A T
10 PI326582B T A A T
11 PI339732 T A A T
12 PI339735A T A A T
13 PI339735B T A A T
14 PI342618A T A A T
In reality I have a dataset of 50,000 columns of 479 rows. My objective is to go through each column with characters and convert the data to integers depending on which is the most abundant character.
Right now I have the data input, and I have more or less written the function I would like to use to analyze each column separately. However, I can't quite understand how to use a forloop or use the apply function through all of the columns in the dataset. I would prefer not to hardcode the columns because I will have 40,000~50,000 columns to analyze.
My code so far is:
import pandas as pd
df = pd.read_csv("/home/dfreese/Desktop/testSNPtext", delimiter='\t')
df.head() # check that the file format fits
# ncol df
df2 = df.iloc[4:-1] # Select the rows you want to analyze in a subset df
print(df2)
My function:
def countAlleles(N):
# N is just suppose to be the column, ideally once I've optimized the function
# I need to analyze every column
# Will hold the counts of each letter in the column
letterCount = []
# This is a parallel array to know the order
letterOrder = {'T','A','G','C','H','U'}
# Boolean to use which one is the maximum
TFlag = None
AFlag = None
GFlag = None
CFlag = None
HFlag = None
UFlag = None
# Loop through the column to determine which one is the maximum
for i in range(len(N)): # How do I get index information of the column?
if(N[i] == 'T'): # If the element in the column is T
letterCount[0] = letterCount[0] + 1
elif(N[i] == 'A'):
letterCount[1] = letterCount [1] + 1
elif (N[i] == 'G'):
letterCount[2] = letterCount [2] + 1
elif (N[i] == 'C'):
lettercount[3] = letterCount[3] + 1
elif(N[i] == 'H'):
letterCount[4] = letterCount[4] + 1
else:
letterCount[5] = letterCount[5] + 1
max = letterCount[0] # This will hold the value of maximum
mIndex = 0 # This holds the index position with the max value
# Determine which one is max
for i in range(len(letterCount)):
if (letterCount[i] > max):
max = letterCount[i]
mIndex = i
So I designed the function to input the column, in hopes to be able to iterate through all the columns of the dataframe. My main question is:
1) How would I pass each in each column as a parameter to the for loop through the elements of each column?
My major source of confusion is how indexes are being used in pandas. I'm familiar with 2-dimensional array in C++ and Java and that is most of where my knowledge stems from.
I'm attempting to use the apply function:
df2 = df2.apply(countAlleles('ss715583617'), axis=2)
but it doesn't seem that my application is correct.
Updated answer: Now the dataframe is analyzed and replaced with the int values according to the occurences of a allele per column. The problem with what happens if one allele has the same number of occurences than the other is still the same - the assignment will be not unique.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],
"ss2": ["G", "G", "T", "A"],
"ss3": ["C", "H", "C", "H"]})
letterOrder = np.array(['T', 'A', 'G', 'C', 'H', 'U'])
for col in df:
alleles = list()
for num, allele in enumerate(letterOrder):
alleles.append(df[col].str.count(allele).sum())
# dictionary with full sorted keys
repl = letterOrder[np.argsort(alleles)][::-1]
# directly replace chars by value
for num, char in enumerate(repl):
df[col].replace(char, num+1, inplace=True)
print(df)
This will change the initial dataframe
ss1 ss2 ss3
0 T G C
1 T G H
2 T T C
3 G A H
to the new dataframe with ints sorted according to the number of occurences:
ss1 ss2 ss3
0 1 1 2
1 1 1 1
2 1 3 2
3 2 2 1
For reference the old answer which gives the maximum column indices:
import pandas as pd
import numpy as np
from collections import OrderedDict
df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],
"ss2": ["G", "G", "T", "A"],
"ss3": ["C", "H", "C", "H"]})
letterOrder = ['T', 'A', 'G', 'C', 'H', 'U']
full_results = OrderedDict()
for col in df:
alleles = list()
for num, allele in enumerate(letterOrder):
alleles.append(df[col].str.count(allele).sum())
full_results[col] = [letterOrder[np.argmax(alleles)], np.max(alleles)]
print(full_results)
This will give:
OrderedDict([('ss1', ['T', 3]), ('ss2', ['G', 2]), ('ss3', ['C', 2])])
The key in the dict is the name of your column, and the value is a list with [allele, number_of_occurences].
I used OrderedDict to keep the order of your columns and the name, but if you don't need the order, you can use a dict, or if you don't need the column name (and the implicit ID is enough), use a list.
But be careful: If in one column two (or more) characters have the same number of counts, this will only return one of them. You would need to add an additional test for this.
To iterate over columns in e.g. a for loop, use list(df). Anyhow, you can easily do what you are attempting using collections.Counter
assume a dataframe df
df
# Name ss715583617 ss715592335 ss715591044 ss715598181
#0 PI081762 T A A T
#1 PI101404A T A A T
#2 PI101404B T A A T
#3 PI135624 T A A T
#4 PI326581 T A F D
#5 PI326582A G A F T
#6 PI326582B G A A T
#7 PI339732 D H A T
#8 PI339735A D A A T
#9 PI339735B A A A T
#10 PI342618A D A A T
What I gather from the comments sections and your original post, you want to replace each character in each column according to it's frequency of occurrence. This is one approach:
Make the Counters
from collections import Counter
cols = [ col for col in list(df) if col not in ['Name'] ] # all the column you want to operate on
col_counters = { col: Counter( df[col] ) for col in cols }
#{'ss715583617': Counter({'T': 5, 'D': 3, 'G': 2, 'A': 1}),
# 'ss715591044': Counter({'A': 9, 'F': 2}),
# 'ss715592335': Counter({'A': 10, 'H': 1}),
# 'ss715598181': Counter({'T': 10, 'D': 1})}
Sort the items in each Counter
sort_func = lambda items: sorted(items, key=lambda x:x[1], reverse=True ) # sort a nested list according to second element in each sublist
sort_result = { col: sort_func(counter.items()) for col,counter in col_counters.iteritems() }
#{'ss715583617': [('T', 5), ('D', 3), ('G', 2), ('A', 1)],
# 'ss715591044': [('A', 9), ('F', 2)],
# 'ss715592335': [('A', 10), ('H', 1)],
# 'ss715598181': [('T', 10), ('D', 1)]}
Replace letters in dataframe according to sort result
Here we will use enumerate to get the position of each sort result
mapper = { col: {letter:i+1 for i,letter in enumerate(sort_result[col]) } for col in sort_result }
#{'ss715583617': {'A': 4, 'D': 2, 'G': 3, 'T': 1},
# 'ss715591044': {'A': 1, 'F': 2},
# 'ss715592335': {'A': 1, 'H': 2},
# 'ss715598181': {'D': 2, 'T': 1}}
df.replace( to_replace=mapper, inplace=True)
# Name ss715583617 ss715592335 ss715591044 ss715598181
#0 PI081762 1 1 1 1
#1 PI101404A 1 1 1 1
#2 PI101404B 1 1 1 1
#3 PI135624 1 1 1 1
#4 PI326581 1 1 2 2
#5 PI326582A 3 1 2 1
#6 PI326582B 3 1 1 1
#7 PI339732 2 2 1 1
#8 PI339735A 2 1 1 1
#9 PI339735B 4 1 1 1
#10 PI342618A 2 1 1 1
This should be enough to get you on your way. I am not sure how you want to handle duplicate elements, for instance if a column has the same number if T and G.
Related
Suppose I have a dataframe with (for example) 10 columns: a,b,c,d,e,f,g,h,i,j
I want to bucket these columns as follows: a,b,c into x, d,f,g into y, e,h,i into z and j into j.
Each row of the output will have the x column value equal to the non-NaN a or b or c value of the original df. In case of multiple non-NaN values for a,b,c columns for a particular row in the original df, the output df will just contain a list of those non-NaN values.
To give an example, if the original df is (- just means NaN to save typing effort):
a b c d e f g h i j
0 1 - - - 2 - 4 3 - -
1 - 6 - 0 4 - - - - 2
2 - 3 2 - - - - 1 - 9
The output will be:
x y z j
0 1 4 [2,3] -
1 6 0 4 2
2 [3,2] - 1 9
Is there an efficient way of doing this? I'm not even able to get started using conventional methods.
one way is to create a dictionary with your mappings, apply your column names, stack and to apply your groupby operation and unstack to your original shape.
I couldn't see any logic in your mappings so it will have to be a manual operation I'm afraid.
buckets = {'x': ['a', 'b', 'c'], 'y': ['d', 'f', 'g'], 'z': ['e', 'h', 'i'], 'j': 'j'}
df.columns = df.columns.map( {i : x for x,y in buckets.items() for i in y})
out = df.stack().groupby(level=[0,1]).agg(list).unstack(1)[buckets.keys()]
print(out)
x y z j
0 [1] [4] [2, 3] NaN
1 [6] [0] [4] [2]
2 [3, 2] NaN [1] [9]
First create the dict for mapping , the groupby
d = {'a':'x','b':'x','c':'x','d':'y','f':'y','g':'y','e':'z','h':'z','i':'z','j':'j'}
out = df.groupby(d,axis=1).agg(lambda x : [y[y!='-']for y in x.values])
Out[138]:
j x y z
0 [] [1] [4] [2, 3]
1 [2] [6] [0] [4]
2 [9] [3, 2] [] [1]
Starting with a very basic approach, let's define our buckets and simply iterate, then clean up:
buckets = {
'x': ['a', 'b', 'c'],
'y': ['d', 'e', 'f'],
'z': ['g', 'h', 'i'],
'j': ['j']
}
def clean(val):
val = [x for x in val if not np.isnan(val)]
if len(val) == 0:
return np.nan
elif len(val) == 1:
return val[0]
else:
return val
new_df = pd.DataFrame()
for new_col, old_cols in buckets.items():
new_df[key] = df[old_cols].values.tolist().apply(clean)
Here's how you can do it.
First, we define a method to perform the row-wise bucketing operation.
def bucket_rows(row):
row = row.dropna().to_list()
if len(row) == 0:
row = [np.nan]
return row
Then, we can use the pandas.DataFrame.apply method to map this function onto each row on a dataframe (here, a sub-dataframe, if you will, since we'll get the sub-df using the column names).
I have implemented everything in the following code snippet.
import numpy as np
import pandas as pd
bucket_cols=[["a", "b", "c"], ["d", "f", "g"], ["e", "h","i"], ["j"]]
bucket_names=["x", "y", "z", "j"]
buckets = {}
def bucket_rows(row):
row = row.dropna().to_list() # applying pd.Series.dropna method to remove NaN values
# if the list is empty, populate it with NaN
if len(row) == 0:
row = [np.nan]
# returns bucketed row
return row
# looping through buckets and perforing bucketing operation
for idx, cols in enumerate(bucket_cols):
bucket = df[cols].apply(bucket_rows, axis=1).to_list()
buckets[idx] = bucket
# creating bucketted df from buckets dict
df_bucketted = pd.DataFrame(buckets)
I have a dataframe-
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1,1]})
a b c
0 1 0 1
1 2 3 1
2 4 5 1
and a list [('a', 0.91), ('b', 5), ('c', 2)].
Now I want to create another dataframe which iterates through each row and multiplies df element and list element together and and then selects the top 2 scores and makes a new list which has the said column names.
for example in the first row we have-
1*0.9=0.9 , 0*5=0 , 1*2=2
therefore the top 2 columns are a and c so we append them to a new list.
second row-
2*0.9=1.8, 3*5=15,1*2=2
therefore list=[a,c,b]
and so on...
third row-
4*0.9=3.6,5*5=25,1*2=2
so list remains unchanged [a,c,b]
so final output is [a,c,b]
If i understand you correctly I think the previous answers are incomplete so here is a solution. It involves using numpy which i hope you accept.
Create the weights:
n = [('a', 0.91), ('b', 5), ('c', 2)]
d = { a:b for a,b in n}
weights = [d[i] for i in df.columns]
Then we create a table with weights multiplied in:
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]})
df = df*weights
This yields:
a b c
0 0.9 0.0 2.0
1 1.8 15.0 2.0
2 3.6 25.0 2.0
Then we can get top two indices for this in numpy:
b = np.argsort(df.values,axis=1)
b = b[:,-2:]
This yields:
array([[0, 2],
[2, 1],
[0, 1]], dtype=int64)
Finally we can calculate the order of appearance and give back column names:
c =b.reshape(-1)
_, idx = np.unique(c, return_index=True)
d = c[np.sort(idx)]
print(list(df.columns[d].values))
This yields:
['a', 'c', 'b']
Try this :
dict1 = {'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]} # arrays must all be same length
df = pd.DataFrame(dict1)
list1 = [('a', 0.91), ('b', 5), ('c', 2)]
df2 = pd.DataFrame({k : [j*v[1] for j in dict1[k]] for k in dict1 for v in list1 if k == v[0]})
"""
df2 should be like this :
a b c
0 0.91 0 2
1 1.82 15 2
2 3.64 25 2
"""
IIUC, you need:
a = [('a', 0.91), ('b', 5), ('c', 2)]
m= df.mul(pd.DataFrame(a).set_index(0)[1])
a b c
0 0.91 0.0 2.0
1 1.82 15.0 2.0
2 3.64 25.0 2.0
Applying rank on each row and taking the sum , then sorting and finding the index gives your desired output.
m.rank(axis=1,method='dense').sum().sort_values().index.tolist()
#['a', 'c', 'b']
I'm trying to create a cognitive task named 2-backed test.
I created a semi-random list with certain conditions and now I wanted to know what should be the good answer for the participant.
I want a column in my dataframe saying if yes or no, 2 letters before it was the same letter.
Here is my code :
from random import choice, shuffle
import pandas as pd
num = 60
letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
# letters_1 = [1, 2, 3, 4, 5, 6]
my_list = [choice(letters), choice(letters)]
probab = list(range(num - 2))
shuffle(probab)
# We want 20% of the letters to repeat the letter 2 letters back
pourc = 20
repeatnum = num * pourc // 100
for i in probab:
ch = prev = my_list[-2]
if i >= repeatnum:
while ch == prev:
ch = choice(letters)
my_list.append(ch)
df = pd.DataFrame(my_list, columns=["letters"])
df.head(10)
letters
0 F
1 I
2 D
3 I
4 H
5 C
6 L
7 G
8 D
9 L
# Create a list to store the data
response = []
# For each row in the column,
for i in df['letters']:
# if more than a value,
if i == [i - 2]:
response.append('yes')
else:
response.append('no')
# Create a column from the list
df['response'] = response
First error :
if i == [i - 2]:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I use numbers instead of letters, I can get over this error, but I would prefer keeping letters..
But after that if I run it with number, I get no errors, but my new column response only have 'no'. But I know that 12 times it should be 'yes'.
It seems like you want to perform a comparison on the column and the same column shifted by two elements. Use shift + np.where -
df['response'] = np.where(df.letters.eq(df.letters.shift(2)), 'yes', 'no')
df.head(10)
letters response
0 F no
1 I no
2 D no
3 I yes
4 H no
5 C no
6 L no
7 G no
8 D no
9 L no
But I know that 12 times it should be 'yes'.
df.response.eq('yes').sum()
12
I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned
I want to count number of times each values is appearing in dataframe.
Here is my dataframe - df:
status
1 N
2 N
3 C
4 N
5 S
6 N
7 N
8 S
9 N
10 N
11 N
12 S
13 N
14 C
15 N
16 N
17 N
18 N
19 S
20 N
I want to dictionary of counts:
ex. counts = {N: 14, C:2, S:4}
I have tried df['status']['N'] but it gives keyError and also df['status'].value_counts but no use.
You can use value_counts and to_dict:
print df['status'].value_counts()
N 14
S 4
C 2
Name: status, dtype: int64
counts = df['status'].value_counts().to_dict()
print counts
{'S': 4, 'C': 2, 'N': 14}
An alternative one liner using underdog Counter:
In [3]: from collections import Counter
In [4]: dict(Counter(df.status))
Out[4]: {'C': 2, 'N': 14, 'S': 4}
You can try this way.
df.stack().value_counts().to_dict()
Can you convert df into a list?
If so:
a = ['a', 'a', 'a', 'b', 'b', 'c']
c = dict()
for i in set(a):
c[i] = a.count(i)
Using a dict comprehension:
c = {i: a.count(i) for i in set(a)}
See my response in this thread for a Pandas DataFrame output,
count the frequency that a value occurs in a dataframe column
For dictionary output, you can modify as follows:
def column_list_dict(x):
column_list_df = []
for col_name in x.columns:
y = col_name, len(x[col_name].unique())
column_list_df.append(y)
return dict(column_list_df)