I've seen a lot of questions on how to convert pandas dataframes to nested dictionaries, but none of them deal with aggregating the information. I may even be able to do what I need within pandas, but I'm stuck.
Input
I have a dataframe that looks like this:
FeatureID gene Target pos bc_count
0 1_1_1 NRAS_3 TAGCAC 0 0.42
1 1_1_1 NRAS_3 TGCACA 1 1.00
2 1_1_1 NRAS_3 GCACAA 2 0.50
3 1_1_1 NRAS_3 CACAAA 3 2.00
4 1_1_1 NRAS_3 CAGAAA 3 0.42
# create df as below
import pandas as pd
df = pd.DataFrame([{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"TAGCAC",
"pos":0, "bc_count":.42},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"TGCACA", "pos":1,
"bc_count":1.00},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"GCACAA", "pos":2,
"bc_count":0.50},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"CACAAA", "pos":3,
"bc_count":2.00},
{"FeatureID":"1_1_1", "gene":"NRAS_3", "Target":"CAGAAA", "pos":4,
"bc_count":0.42}])
The problem
I need to break apart the Target column for each row to return a tuple of (position, letter, count), where the starting position is given in the "pos" column, and then enumerating the string for each position following, and the count is the value found for that row in the "bc_count" column.
For example, in the first row, the desired list of tuples would be:
[(0, "T", 0.42), (1,"A", 0.42), (2,"G", 0.42), (3,"C", 0.42), (4,"A", 0.42), (5,"C", 0.42)]
What I've tried
I've created code that breaks up the target column into the position found, returning a tuple of position, nucleotide (letter) and count for that letter, and adds them as a column to the dataframe:
def index_target(row):
count_list = [((row.pos + x),y,
row.bc_count) for x,y in
enumerate(row.Target)]
df['pos_count'] = df.apply(self.index_target, axis=1)
Which returns a list of tuples for each row based on that row's target column.
I need to take every row in df, for each target, and sum the counts. Which is why I thought of using a dictionary as a counter:
position[letter] += bc_count
I've tried creating a defaultdict, but it is appending each list of tuples separately instead of summing the counts for each position:
from collections import defaultdict
d = defaultdict(dict) # also tried defaultdict(list) here
for x,y,z in row.pos_count:
d[x][y] += z
Desired Output
For each feature in the dataframe, where the numbers below represent a sum of the individual counts found in the bc_count column for each position and x representing positions where ties were found and no one letter can be returned as the max:
pos A T G C
0 25 80 25 57
1 32 19 100 32
2 27 18 16 27
3 90 90 90 90
4 10 42 37 18
consensus= TGXXT
This may not be the most elegant solution, but I think it might accomplish what you need:
new_df = pd.DataFrame(
df.apply(
# this lambda is basically the same thing you're doing,
# but we create a pd.Series with it
lambda row: pd.Series(
[(row.pos + i, c, row.bc_count) for i, c in enumerate(row.Target)]
),
axis=1)
.stack().tolist(),
columns=["pos", "nucl", "count"]
)
Where new_df looks like this:
pos nucl count
0 0 T 0.42
1 1 A 0.42
2 2 G 0.42
3 3 C 0.42
4 4 A 0.42
5 5 C 0.42
6 1 T 1.00
7 2 G 1.00
8 3 C 1.00
9 4 A 1.00
Then I would pivot this to get the aggregated counts:
nucleotide_count_by_pos = new_df.pivot_table(
index="pos",
columns="nucl",
values="count",
aggfunc="sum",
fill_value=0
)
Where nucleotide_count_by_pos looks like:
nucl A C G T
pos
0 0.00 0.00 0.00 0.42
1 0.42 0.00 0.00 1.00
2 0.00 0.00 1.92 0.00
3 0.00 4.34 0.00 0.00
4 4.34 0.00 0.00 0.00
And then to get the consensus:
def get_consensus(row):
max_value = row.max()
nuc = row.idxmax()
if (row == max_value).sum() == 1:
return nuc
else:
return "X"
consensus = ''.join(nucleotide_count_by_pos.apply(get_consensus, axis=1).tolist())
Which in the case of your example data would be:
'TTGCACAAA'
Unsure how to get your desired output, but I created the list d which contains the tuples you desired for a dataframe. Hopefully, it provides some direction in what you want to create:
d = []
for t,c,p in zip(df.Target,df.bc_count,df.pos):
d.extend([(p,c,i) for i in list(t)])
df_new = pd.DataFrame(d, columns = ['pos','count','val'])
df_new = df_new.groupby(['pos','val']).agg({'count':'sum'}).reset_index()
df_new.pivot(index = 'pos', columns = 'val', values = 'count')
Related
How to delete all the rows under the row with one column "Exercises" in pandas (Python)?
Data:
2021.08.16 19:37:15 146242975 XAUEUR buy 0.02 1 517.04 1 517.19 1 519.54 2021.08.16 20:38:30 1 517.15 - 0.12 0.00 0.22
2021.08.16 19:37:15 146242976 XAUEUR buy 0.02 1 517.04 1 517.19 1 522.04 2021.08.16 20:38:30 1 517.15 - 0.12 0.00 0.22
Exercises
2021.08.16 01:02:11 146037881 XAUUSD buy 0.18 / 0.18 market 1 777.72 1 781.47 2021.08.16 01:02:11 filled TP1
...
df = pd.DataFrame({'num':[1,2,3,4,'Excercises',6,7,8]})
#First find the row index by filtering the column value
my_index = df.index[df['num'] == 'Exercises'].tolist()[0] # as you my find multiple match, take the first index found by [0]
#my_index = 4
#Then slice the Dataframe and take values into new df
df_new = df[:my_index] # or if you want to exclude that row , then add +1 to my_index
I used the loc function.
df = pd.DataFrame({'col':['2021.08.16 19:37:15 146242975','2021.08.16 19:37:15 146242976','Exercises','2021.08.16 01:02:11 146037881'],'values':['a','b','c','d']})
df2 = df.set_index('col')
df2.loc[:'Exercises'][:-1].reset_index()
I have two pandas DataFrames with the same DateTime index.
The first one is J:
A B C
01/01/10 100 400 200
01/02/10 300 200 400
01/03/10 200 100 300
The second one is K:
100 200 300 400
01/01/10 0.05 -0.42 0.61 -0.12
01/02/10 -0.23 0.11 0.82 0.34
01/03/10 -0.55 0.24 -0.01 -0.73
I would like to use J to reference K and create a third DataFrame L that looks like:
A B C
01/01/10 0.05 -0.12 -0.42
01/02/10 0.82 0.11 0.34
01/03/10 0.24 -0.55 -0.01
To do so, I need to take each value in J and look up the corresponding value in K where the column name is that value for the same date.
I tried to do:
L = J.apply( lambda x: K.loc[ x.index, x ], axis='index' )
but get:
ValueError: If using all scalar values, you must pass an index
I would ideally like to use this so that any NaN values contained in J will remain as is, and will not be looked up in K. I had unsuccessfully tried this:
L = J.apply( lambda x: np.nan if np.isnan( x.astype( float ) ) else K.loc[ x.index, x ] )
Use DataFrame.melt and DataFrame.stack to use DataFrame.join to map the new values, then We return the DataFrame to original shape with DataFrame.pivot:
#if neccesary
#K = K.rename(columns = int)
L = (J.reset_index()
.melt('index')
.join(K.stack().rename('new_values'),on = ['index','value'])
.pivot(index = 'index',
columns='variable',
values = 'new_values')
.rename_axis(columns = None,index = None)
)
print(L)
Or with DataFrame.lookup
L = J.reset_index().melt('index')
L['value'] = K.lookup(L['index'],L['value'])
L = L.pivot(*L).rename_axis(columns = None,index = None)
print(L)
Output
A B C
01/01/10 0.05 -0.12 -0.42
01/02/10 0.82 0.11 0.34
01/03/10 0.24 -0.55 -0.01
I think that apply could be a good option but I'm not sure,
I recommend you see When should I want use apply in my code
Use DataFrame.apply with DataFrame.lookup for label based indexing.
# if needed, convert columns of df2 to integers
# K.columns = K.columns.astype(int)
L = J.apply(lambda x: K.lookup(x.index, x))
A B C
01/01/10 0.05 -0.12 -0.42
01/02/10 0.82 0.11 0.34
01/03/10 0.24 -0.55 -0.01
I've a Column that contains 0 and 12/02/19 dates, I want to transforming all dates into ones and multiply by the column Enrolls_F
-
Preferring using REGEX, but any other options should be fine too, it is a large Dataset, I tried with simple for loop and my kernel could not run it.
-
Data:
df = pd.DataFrame({ 'Enrolled_Date': ['0','2019/02/04','0','0','2019/02/04','0'] , 'Enrolls_F': ['1.11','1.11','0.222','1.11','5.22','1'] })
Attempts:
trying to search for everything starts with 2 and replace with 1 and multiply by Enrolls_F
df_test = (df.replace({'Enrolled_Date': r'2.$'}, {'Enrolled_Date': '1'}, regex=True)) * df.Enrolls_F
# Nothing happens
IIUC, this should help you get the trouble sorted;
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Enrolled_Date': ['0','2019/02/04','0','0','2019/02/04','0'] , 'Enrolls_F': ['1.11','1.11','0.222','1.11','5.22','1'] })
df['Enrolled_Date'] = np.where(df['Enrolled_Date'] == '0',0,1)
df['multiplication_column'] = df['Enrolled_Date'] * df['Enrolls_F']
print(df)
Output:
Enrolled_Date Enrolls_F multiplication_column
0 0 1.11
1 1 1.11 1.11
2 0 0.222
3 0 1.11
4 1 5.22 5.22
5 0 1
If you want output is float, try this
df.Enrolled_Date.ne('0').astype(int) * df.Enrolls_F.astype(float)
Out[212]:
0 0.00
1 1.11
2 0.00
3 0.00
4 5.22
5 0.00
dtype: float64
I have two dataframes. One has some probability brackets.
df1 = pd.DataFrame({'ProbabilityBrackets' : [0,0.50,0.75,1.0,0.75,0.90,1.0,0],\
'Group' : pd.Categorical(["test","test","test","test","train","train","train","train"]),'Destination' : pd.Categorical(["-","A","B","C","AA","BB","CC","-"])})
Destination Group ProbabilityBrackets
0 - test 0.00
1 A test 0.50
2 B test 0.75
3 C test 1.00
4 AA train 0.75
5 BB train 0.90
6 CC train 1.00
7 - train 0.00
The other dataframe has some random numbers and the group column.
df2 = pd.DataFrame({'randomnumbers' : [0.2,0.15,0.78,0.35],\
'Group' : pd.Categorical(["test","train","test","train"])})
Group randomnumbers
0 test 0.20
1 train 0.15
2 test 0.78
3 train 0.35
Now, I need to merge the two dataframes together by both group and based on the probability brackets. Merging by group is trivial. The challenging requirement is merging by based on probabilitybrackets and random numbers. A random number in df2 should be mapped to the smallest probability bracket that is larger than itself. E.g., test 0.2 in df2 is mapped to test 0.5 in df1. test 0.78 in df2 is mapped to test 1.0 in df1.
I did it as follows, which works well and :
for group in ['test','train']:
brackets=df1[df1['Group']==group].sort_values(by='ProbabilityBrackets')['ProbabilityBrackets'].unique()
bracketlabels = brackets[1:] #remove the first element of the list. (e.g., remove 0 from (0,0.5,1))
df2.loc[df2['Group']==group,'ProbabilityBrackets']=pd.cut(df2['randomnumbers'],brackets, labels=bracketlabels) #assign random numbers to the brackets so that we can easily merge them with df1
df3=df2.merge(df1,on=['Group','ProbabilityBrackets'],how='left')
It generates the following output, which is what I want but it is slower than I want because I have thousands of groups in my dataset. Is there a way to do it faster in a pythonic way?
Group randomnumbers ProbabilityBrackets Destination
0 test 0.20 0.50 A
1 train 0.15 0.75 AA
2 test 0.78 1.00 C
3 train 0.35 0.75 AA
You can try this.
# Step 1
df_m = df2.merge(df1, on="Group", how="outer")
# Step 2
df_m["diff"] = df_m["randomnumbers"] - df_m["ProbabilityBrackets"]
# Step 3
df_m_filtered = df_m[df_m["diff"] < 0].set_index(
["Destination", "ProbabilityBrackets"])
# Step 4
df_desired = df_m_filtered.groupby(
["Group", "randomnumbers"])["diff"].nlargest(1).reset_index()
index Group randomnumbers Destination ProbabilityBrackets diff
0 0 test 0.20 A 0.50 -0.30
1 1 test 0.78 C 1.00 -0.22
2 2 train 0.15 AA 0.75 -0.60
3 3 train 0.35 AA 0.75 -0.40
Explanation:
Begin with an outer merge
Calculate differences between randomnumbers and ProbabilityBrackets
Filter the results with condition df_merged["diff"] < 0 as we are interested in finding those whose randomnumbers is smaller than ProbabilityBrackets
Groupby ["Group" and "randomnumbers"] and find the one with the largest diff within each group.
Comparing “Group” for every element in df2 to every element in df1 is a lot of unnecessary string comparisons. You could instead try putting all the elements of df1 into a dictionary with Group as the key and having lists of (ProbabilityBrackets, Destination) tuples as the values. When inserting each element from df1, insert the tuple into the list maintaining the sort by ProbabilityBracket so that you don’t have to sort it again. Then you can retrieve the appropriate (ProbabilityBracket, Destination) for each element in df2 by looking in the dictionary by Group and performing a binary search on the list by ProbabilityBracket.
This is another way of doing it. Taking some cues from #JasonR.
Explanation:
- We create a dictionary of tuples (Destination, ProbablityBrackets). This is done to avoid multiple times looping on df1
- Next, we check dictionary keys in df2 and assign the result based on given criteria.
from collections import defaultdict
# remove these rows
df1 = df1[df1['ProbabilityBrackets'] > 0]
df_dict = defaultdict(list)
# create a dictionary of tuples in list
for index, row in df1.iterrows():
df_dict[row['Group']].append((row['Destination'],row['ProbabilityBrackets']))
## this calculates the output
for index, row in df2.iterrows():
d = df_dict[row['Group']]
randnum = row['randomnumbers']
## this checks the suitable probablity bracket
low = 10000
tuple_ix = 10000
for ix, (i, j) in enumerate(d):
sub = (j - randnum)
if sub > 0 and sub < low:
low = sub
tuple_ix = ix
combination = d[tuple_ix]
df2.loc[index, 'ProbabilityBracket'] = combination[1]
df2.loc[index, 'Destination'] = combination[0]
Group randomnumbers ProbabilityBracket Destination
0 test 0.20 0.50 A
1 train 0.15 0.75 AA
2 test 0.78 1.00 C
3 train 0.35 0.75 AA
I'm trying to add float values like [[(1,0.44),(2,0.5),(3,0.1)],[(2,0.63),(1,0.85),(3,0.11)],[...]]
to a Pandas dataframe which looks like a matrix build from the first value of the tuples
df = 1 2 3
1 0.44 0.5 0.1
2 0.85 0.63 0.11
3 ... ... ...
I tried this:
for key, value in enumerate(outer_list):
for tuplevalue in value:
df.ix[key][tuplevalue[0]] = tuplevalue[1]
The Problem is that my NxN-Matrix contains about 10000x10000 elements and hence it takes really long with my approach. Is there another possibility to speed this up?
(Unfortunately the values in the list are not ordered by the first tuple element)
Use list comprehensions to first sort and extract your data. Then create your dataframe from the sorted and cleaned data.
data = [[(1, 0.44), (2, 0.50), (3, 0.10)],
[(2, 0.63), (1, 0.85), (3, 0.11)]]
# First, sort each row.
_ = [row.sort() for row in data]
# Then extract the second element of each tuple.
new_data = [[t[1] for t in row] for row in data]
# Now create a dataframe from your data.
>>> pd.DataFrame(new_data)
0 1 2
0 0.44 0.50 0.10
1 0.85 0.63 0.11
This works using a dictionary (if you need to preserve your column order, or if the column names were a string). Maybe Alexander will update his answer to account for that, I'm nearly certain he'll have a better solution than my proposed one :)
Here's an example:
from collections import defaultdict
a = [[(1,0.44),(2,0.5),(3,0.1)],[(2,0.63),(1,0.85),(3,0.11)]]
b = [[('A',0.44),('B',0.5),('C',0.1)],[('B',0.63),('A',0.85),('C',0.11)]]
First on a:
row_to_dic = [{str(y[0]): y[1] for y in x} for x in a]
dd = defaultdict(list)
for d in (row_to_dic):
for key, value in d.iteritems():
dd[key].append(value)
pd.DataFrame.from_dict(dd)
1 2 3
0 0.44 0.50 0.10
1 0.85 0.63 0.11
and b:
row_to_dic = [{str(y[0]): y[1] for y in x} for x in b]
dd = defaultdict(list)
for d in (row_to_dic):
for key, value in d.iteritems():
dd[key].append(value)
pd.DataFrame.from_dict(dd)
A B C
0 0.44 0.50 0.10
1 0.85 0.63 0.11