I have a dataframe that that looks like this:
chr
start
end
plus
minus
total
in_control
sites_in_cluster
mean
cluster
1
1000
1005
6
7
13
Y
3
6
36346
1
1007
10012
3
1
4
N
3
6
36346
1
10014
10020
0
1
1
Y
3
6
36346
2
33532
33554
1
1
2
N
1
2
22123
cluster is an ID assigned to each row, in this case, we have 3 "sites"
In this cluster, two of these sites are in the control (in_control==Y)
I want to create an additional column, which tells me what proportion of the sites are in the control. i.e. (sum(in_control==Y) for a cluster)/sites_in_cluster
In this example, we have two rows with in_control==Y and 3 sites_in_cluster in cluster 36346. Therefore, cluster_sites_in_control would be 2/3 = 0.66 whereas cluster 22123 only has one site and isn't in the control, so would be 0/1=0
chr
start
end
plus
minus
total
in_control
sites_in_cluster
mean
cluster
cluster_sites_in_control
1
1000
1005
6
7
13
Y
3
6
36346
0.66
1
1007
10012
3
1
4
N
3
6
36346
0.66
1
10014
10020
0
1
1
Y
3
6
36346
0.66
2
33532
33554
1
1
2
N
1
2
22123
0.00
I have created code which seemingly accomplishes this, however, it seems to be extremely roundabout and I'm certain there's a better solution out there:
intersect_in_control
# %%
import pandas as pd
#get the number of sites in a control that are 'Y'
number_in_control = pd.DataFrame(intersect_in_control.groupby(['cluster']).in_control.value_counts().unstack(fill_value=0).loc[:,'Y'])
#get the number of breaksites for that cluster
number_of_breaksites = pd.DataFrame(intersect_in_control.groupby(['cluster'])['sites_in_cluster'].count())
#combine these two dataframes
combined_dataframe = pd.concat([number_in_control.reset_index(drop=False), number_of_breaksites.reset_index(drop=True)], axis=1)
#calculate the desired column
combined_dataframe["proportion_in_control"] = combined_dataframe["Y"]/combined_dataframe["sites_in_cluster"]
#left join this new dataframe to the original whilst dropping undesired columns.
cluster_in_control = intersect_in_control.merge((combined_dataframe.drop(["Y","sites_in_cluster"], axis = 1)), on='cluster', how='left')
10 rows of the df as example data:
{'chr': {0: 'chr14',
1: 'chr2',
2: 'chr1',
3: 'chr10',
4: 'chr17',
5: 'chr17',
6: 'chr2',
7: 'chr2',
8: 'chr2',
9: 'chr1',
10: 'chr1'},
'start': {0: 23016497,
1: 133031338,
2: 64081726,
3: 28671025,
4: 45219225,
5: 45219225,
6: 133026750,
7: 133026761,
8: 133026769,
9: 1510391,
10: 15853061},
'end': {0: 23016501,
1: 133031342,
2: 64081732,
3: 28671030,
4: 45219234,
5: 45219234,
6: 133026755,
7: 133026763,
8: 133026770,
9: 1510395,
10: 15853067},
'plus_count': {0: 2,
1: 0,
2: 5,
3: 1,
4: 6,
5: 6,
6: 14,
7: 2,
8: 0,
9: 2,
10: 4},
'minus_count': {0: 6,
1: 7,
2: 1,
3: 5,
4: 0,
5: 0,
6: 0,
7: 0,
8: 2,
9: 3,
10: 1},
'count': {0: 8, 1: 7, 2: 6, 3: 6, 4: 6, 5: 6, 6: 14, 7: 2, 8: 2, 9: 5, 10: 5},
'in_control': {0: 'N',
1: 'N',
2: 'Y',
3: 'N',
4: 'Y',
5: 'Y',
6: 'N',
7: 'Y',
8: 'N',
9: 'Y',
10: 'Y'},
'total_breaks': {0: 8,
1: 7,
2: 6,
3: 6,
4: 6,
5: 6,
6: 18,
7: 18,
8: 18,
9: 5,
10: 5},
'sites_in_cluster': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 3,
7: 3,
8: 3,
9: 1,
10: 1},
'mean_breaks_per_site': {0: 8.0,
1: 7.0,
2: 6.0,
3: 6.0,
4: 6.0,
5: 6.0,
6: 6.0,
7: 6.0,
8: 6.0,
9: 5.0,
10: 5.0},
'cluster': {0: 22665,
1: 24664,
2: 3484,
3: 13818,
4: 23640,
5: 23640,
6: 24652,
7: 24652,
8: 24652,
9: 48,
10: 769}}
Thanks in advance for any help :)
For percentage is possible symplify solution with mean per boolean column and for create new column use GroupBy.transform, it working well because Trues apre processing like 1:
df['cluster_sites_in_control'] = (df['in_control'].eq('Y')
.groupby(df['cluster']).transform('mean'))
print (df)
chr start end plus minus total in_control sites_in_cluster mean \
0 1 1000 1005 6 7 13 Y 3 6
1 1 1007 10012 3 1 4 N 3 6
2 1 10014 10020 0 1 1 Y 3 6
3 2 33532 33554 1 1 2 N 1 2
cluster cluster_sites_in_control
0 36346 0.666667
1 36346 0.666667
2 36346 0.666667
3 22123 0.000000
Related
Given a pandas data frame like the following where the column names are the time, the rows are each of the subjects, and the values are probabilities return the column name (or time) the first time the probability is less than .50 for each subject in the data frame. The probabilities are always descending from 1-0 I. have tried looping though the data frame but it is not computationally efficient.
subject id
0
1
2
3
4
5
6
7
…
669
670
671
1
1
0.997913
0.993116
0.989017
0.976157
0.973078
0.968056
0.963685
…
0.156092
0.156092
0.156092
2
1
0.990335
0.988685
0.983145
0.964912
0.958
0.952
0.946995
…
0.148434
0.148434
0.148434
3
1
0.996231
0.990571
0.985775
0.976809
0.972736
0.969633
0.966116
…
0.17037
0.17037
0.17037
4
1
0.997129
0.994417
0.991054
0.978795
0.974216
0.96806
0.963039
…
0.15192
0.15192
0.15192
5
1
0.997728
0.993598
0.986641
0.98246
0.977371
0.972874
0.96816
…
0.154545
0.154545
0.154545
6
1
0.998134
0.995564
0.989901
0.986941
0.982313
0.972951
0.969645
…
0.17473
0.17473
0.17473
7
1
0.995681
0.994131
0.990401
0.974494
0.967941
0.961859
0.956636
…
0.144753
0.144753
0.144753
8
1
0.997541
0.994904
0.991941
0.983389
0.979375
0.973158
0.966358
…
0.158763
0.158763
0.158763
9
1
0.992253
0.989064
0.979258
0.955747
0.948842
0.942899
0.935784
…
0.150291
0.150291
0.150291
Goal Output
subject id
time prob < .05
1
100
2
99
3
34
4
19
5
600
6
500
7
222
8
111
9
332
Since the probabilities are always descending you can do this:
>>> df.set_index("subject id").gt(.98).sum(1)
subject id
1 4
2 4
3 4
4 4
5 5
6 6
7 4
8 5
9 3
dtype: int64
note: I'm using .98 instead of .5 because I'm using only a portion of the data.
Data used
{'subject id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9},
'0': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
'1': {0: 0.997913,
1: 0.990335,
2: 0.996231,
3: 0.997129,
4: 0.997728,
5: 0.998134,
6: 0.995681,
7: 0.997541,
8: 0.992253},
'2': {0: 0.993116,
1: 0.988685,
2: 0.990571,
3: 0.994417,
4: 0.993598,
5: 0.995564,
6: 0.994131,
7: 0.994904,
8: 0.989064},
'3': {0: 0.989017,
1: 0.983145,
2: 0.985775,
3: 0.991054,
4: 0.986641,
5: 0.989901,
6: 0.990401,
7: 0.991941,
8: 0.979258},
'4': {0: 0.976157,
1: 0.964912,
2: 0.976809,
3: 0.978795,
4: 0.98246,
5: 0.986941,
6: 0.974494,
7: 0.983389,
8: 0.955747},
'5': {0: 0.973078,
1: 0.958,
2: 0.972736,
3: 0.974216,
4: 0.977371,
5: 0.982313,
6: 0.967941,
7: 0.979375,
8: 0.948842},
'6': {0: 0.968056,
1: 0.952,
2: 0.969633,
3: 0.96806,
4: 0.972874,
5: 0.972951,
6: 0.961859,
7: 0.973158,
8: 0.942899},
'7': {0: 0.963685,
1: 0.946995,
2: 0.966116,
3: 0.963039,
4: 0.96816,
5: 0.969645,
6: 0.956636,
7: 0.966358,
8: 0.935784}}
If I understand correctly, I think this is what you are looking for:
df.where(df.lt(.5)).idxmax(axis=1)
I do not know how to do this without four nested for loops.
I'd like to apply a custom function to every possible combination of subsets for hour and day, return that value, and then pivot the data frame into a square matrix. However, these for loops seem excessive so I'm looking for a more efficient way to do this. The data I have is fairly large so any gain in speed would be beneficial.
edit: I updated the question to include a custom function.
Here is an example,
Sample data
import pandas as pd
import numpy as np
dat = pd.DataFrame({'day': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 2, 11: 2, 12: 2, 13: 2, 14: 2, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'hour': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 2, 16: 2, 17: 2, 18: 2, 19: 2}, 'distance': {0: 1.2898851269657656, 1: 0.0, 2: 0.8371526423804061, 3: 0.8703856587273138, 4: 0.6257425922449789, 5: 0.0, 6: 0.0, 7: 0.0, 8: 1.2895328696587023, 9: 0.0, 10: 0.6875527848294374, 11: 0.0, 12: 0.0, 13: 0.9009031833559706, 14: 0.0, 15: 1.1040652963428623, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0}})
Code
def custom_fn(x, y):
x = pd.Series(x)
y = pd.Series(y)
x = x**2
y = np.sqrt(y)
return x.sum() - y.sum()
# Empty data.frame to append to
dmat = pd.DataFrame()
# For i, j = hour; k, l = day
for i in range(1, 3):
for j in range(1, 3):
for k in range(1, 3):
for l in range(1, 3):
x = dat[(dat['hour'] == i) & (dat['day'] == k)].distance
y = dat[(dat['hour'] == j) & (dat['day'] == l)].distance
# Calculate difference
jds = custom_fn(x, y)
# Build data frame and append
outdat = pd.DataFrame({'day_hour_a': f"{k}_{i}", 'day_hour_b': f"{l}_{j}", 'jds': [round(jds, 4)]})
dmat = dmat.append(outdat, ignore_index=True)
# Pivot data to get matrix
distMatrix = dmat.pivot(index='day_hour_a', columns='day_hour_b', values='jds')
Output
> print(distMatrix)
day_hour_b 1_1 1_2 2_1 2_2
day_hour_a
1_1 -0.2609 2.3782 1.7354 2.4630
1_2 -2.1118 0.5273 -0.1155 0.6121
2_1 -2.4903 0.1488 -0.4940 0.2336
2_2 -2.5557 0.0834 -0.5594 0.1682
If I understand correctly, what you're doing is the same as the following:
def f(x):
return x.mean()
x = df.groupby(['day', 'hour'])['distance'].apply(f)
x = x.values[:,None] - x.values
print(x)
Output:
[[ 0. 0.46672663 0.40694201 0.50382014]
[-0.46672663 0. -0.05978462 0.03709351]
[-0.40694201 0.05978462 0. 0.09687813]
[-0.50382014 -0.03709351 -0.09687813 0. ]]
Update: For the updated custom function you can still break it down into separate groupbys:
g = df.groupby(['day', 'hour'])['distance']
x = g.apply(lambda z: (z**2).sum())
y = g.apply(lambda z: np.sqrt(z).sum())
x.values[:,None] - y.values
Output:
array([[-0.26092193, 2.37817717, 1.73540595, 2.46300806],
[-2.11178008, 0.52731901, -0.1154522 , 0.61214991],
[-2.49031973, 0.14877937, -0.49399185, 0.23361026],
[-2.55571493, 0.08338417, -0.55938705, 0.16821506]])
Update 2: If calculations can't be separated, another alternative is:
def f(x, y):
return distance.jensenshannon(x, y)
x = []
g = df.groupby(['day', 'hour'])['distance']
for k1, g1 in g:
for k2, g2 in g:
x += [(k1, k2, f(g1, g2))]
x = pd.DataFrame(x).pivot(index=0, columns=1, values=2)
print(x)
Output:
1 (1, 1) (1, 2) (2, 1) (2, 2)
0
(1, 1) 0.000000 0.623167 0.419371 0.550291
(1, 2) 0.623167 0.000000 0.424608 0.832555
(2, 1) 0.419371 0.424608 0.000000 0.504233
(2, 2) 0.550291 0.832555 0.504233 0.000000
Here's the code:
ValuesDict = {}
PathwayTemplate = {}
GenerationDictionary = {0:[]}
StartNumbers = [2,3,5,7]
for Number in StartNumbers:
PathwayTemplate.update({Number:0})
for Number in StartNumbers:
ValuesDict.update({Number:[PathwayTemplate.copy()]})
ValuesDict[Number][0][Number] += 1
GenerationDictionary[0].append(Number)
Fin = 15
Gen = 0
while min(GenerationDictionary[Gen])<=Fin:
Gen += 1
GenerationDictionary.update({Gen:[]})
for Number in GenerationDictionary[Gen-1]:
for StartNumber in StartNumbers:
if (Number+StartNumber) in ValuesDict.keys():
pass
else:
ValuesDict.update({(Number+StartNumber):list(ValuesDict[Number])})
for subpath in ValuesDict[(Number+StartNumber)]:
subpath[StartNumber]+=1
GenerationDictionary[Gen].append((Number+StartNumber))
print((Number+StartNumber),ValuesDict[(Number+StartNumber)])
print()
And under first iteration it outputs:
4 [{2: 2, 3: 0, 5: 0, 7: 0}]
9 [{2: 2, 3: 0, 5: 0, 7: 1}]
6 [{2: 0, 3: 2, 5: 0, 7: 0}]
8 [{2: 0, 3: 2, 5: 1, 7: 0}]
10 [{2: 0, 3: 2, 5: 1, 7: 1}]
12 [{2: 0, 3: 0, 5: 1, 7: 1}]
14 [{2: 0, 3: 0, 5: 0, 7: 2}]
Where I expected it to give me one 2 and one 7 under 9.
It's meant to add numbers together and store number added as a result.
I tried reading about Python 3 saving memory and making a pointer type link when a=bso when a changes, so does b, but it'd appear that they are linked here despite being re-cast as a list. I've also tried checking if a is b and it gave me false, however it clearly seems to auto-update my original ValuesDict[Number].
I have a dataset shown in the below :
And want to drop rows like 4,5 & 7 as there majority of columns are having 0 but not all. At the same time, I don't want to drop rows like 0 and 1 as they have very few entries as 0.
First create a column to calculate zeros in your rows
df['no_of_zeros']=(df == 0).astype(int).sum(axis=1)
Define how many zeros are acceptable in your row and filter the dataframe according to it.
df=df[df['no_of_zeros'] < 3].drop(['no_of_zeros'], axis=1)
Here is one way:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 4],
[0, 0, 0, 1, 2]],
columns=['A', 'B', 'C', 'D', 'E'])
df = df[~((df == 0).astype(int).sum(axis=1) > len(df.columns) / 2)]
# A B C D E
# 0 0 1 2 3 4
Assuming "majority" means "more than half of the columns", this works:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'c2': {0: 76, 1: 45, 2: 47, 3: 92, 4: 0, 5: 0, 6: 26, 7: 0, 8: 71},
...: 'c3': {0: 0, 1: 3, 2: 6, 3: 9, 4: 0, 5: 0, 6: 12, 7: 0, 8: 15},
...: 'c4': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
...: 'c5': {0: 23, 1: 0, 2: 23, 3: 23, 4: 0, 5: 0, 6: 23, 7: 0, 8: 23},
...: 'c6': {0: 65, 1: 25, 2: 62, 3: 26, 4: 52, 5: 22, 6: 65, 7: 0, 8: 69},
...: 'c7': {0: 12, 1: 12, 2: 12, 3: 12, 4: 12, 5: 12, 6: 12, 7: 12, 8: 12},
...: 'c8': {0: 0, 1: 0, 2: 8, 3: 9, 4: 0, 5: 0, 6: 4, 7: 0, 8: 4},
...: 'cl': {0: 5, 1: 7, 2: 8, 3: 15, 4: 0, 5: 0, 6: 2, 7: 0, 8: 5},
...: 'sr': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8}})
...:
In [3]: df
Out[3]:
c2 c3 c4 c5 c6 c7 c8 cl sr
0 76 0 1 23 65 12 0 5 0
1 45 3 1 0 25 12 0 7 1
2 47 6 1 23 62 12 8 8 2
3 92 9 1 23 26 12 9 15 3
4 0 0 1 0 52 12 0 0 4
5 0 0 1 0 22 12 0 0 5
6 26 12 1 23 65 12 4 2 6
7 0 0 1 0 0 12 0 0 7
8 71 15 1 23 69 12 4 5 8
In [4]: df[((df == 0).sum(axis=1) <= len(df.columns) / 2)]
Out[4]:
c2 c3 c4 c5 c6 c7 c8 cl sr
0 76 0 1 23 65 12 0 5 0
1 45 3 1 0 25 12 0 7 1
2 47 6 1 23 62 12 8 8 2
3 92 9 1 23 26 12 9 15 3
6 26 12 1 23 65 12 4 2 6
8 71 15 1 23 69 12 4 5 8
In [5]:
I have the following dataset:
data = {'VALVE_SCORE': {0: 34.1,1: 41.0,2: 49.7,3: 53.8,4: 35.8,5: 49.2,6: 38.6,7: 51.2,8: 44.8,9: 51.5,10: 41.9,11: 46.0,12: 41.9,13: 51.4,14: 35.0,15: 49.7,16: 41.5,17: 51.5,18: 45.2,19: 53.4,20: 38.1,21: 50.2,22: 25.4,23: 30.0,24: 28.1,25: 49.9,26: 27.5,27: 37.2,28: 27.7,29: 45.7,30: 27.2,31: 30.0,32: 27.9,33: 34.3,34: 29.5,35: 34.5,36: 28.0,37: 33.6,38: 26.8,39: 31.8},
'DAY': {0: 6, 1: 6, 2: 6, 3: 6, 4: 13, 5: 13, 6: 13, 7: 13, 8: 20, 9: 20, 10: 20, 11: 20, 12: 27, 13: 27, 14: 27, 15: 27, 16: 3, 17: 3, 18: 3, 19: 3, 20: 10, 21: 10, 22: 10, 23: 10, 24: 17, 25: 17, 26: 17, 27: 17, 28: 24, 29: 24, 30: 24, 31: 24, 32: 3, 33: 3, 34: 3, 35: 3, 36: 10, 37: 10, 38: 10, 39: 10},
'MONTH': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 2, 17: 2, 18: 2, 19: 2, 20: 2, 21: 2, 22: 2, 23: 2, 24: 2, 25: 2, 26: 2, 27: 2, 28: 2, 29: 2, 30: 2, 31: 2, 32: 3, 33: 3, 34: 3, 35: 3, 36: 3, 37: 3, 38: 3, 39: 3}}
df = pd.DataFrame(data)
First, I would like to take the mean by day and then by month. However, taking the mean by grouping the days results in decimal months. I would like to preserve the months before I do a groupby('MONTH').mean()
In [401]: df.groupby("DAY").mean()
Out[401]:
VALVE_SCORE MONTH
DAY
3 39.7250 2.5
6 44.6500 1.0
10 32.9875 2.5
13 43.7000 1.0
17 35.6750 2.0
20 46.0500 1.0
24 32.6500 2.0
27 44.5000 1.0
I would like the end result to be:
MONTH VALVE_SCORE
1 value
2 value
3 value
Consider with the data you have, you would like to have the daily mean and then the monthly mean. Putting the same in an Excel pivot table will result like this:
Do doing the same in pandas, grouping by months is enough to get the same result:
df.groupby(['MONTH']).mean()
DAY VALVE_SCORE
MONTH
1 16.5 44.7250
2 13.5 38.0375
3 6.5 30.8000
Since the month and day values are numeric, pandas process it, consider 'DAY' and 'MONTH' values are not numeric and are strings, you get this result:
VALVE_SCORE
MONTH
1 44.7250
2 38.0375
3 30.8000
So pandas already computes the daily means and using it computes the monthly means.
Here's a possible solution. Do let me know if there is a more efficient way of doing it.
df = pd.DataFrame(data)
months = list(df['MONTH'].unique())
frames = []
for p in months:
df_part = df[df['MONTH'] == p]
df_part_avg = df_part.groupby("DAY", as_index=False).mean()
df_part_avg = df_part_avg.drop('DAY', axis=1)
frames.append(df_part_avg)
df_months = pd.concat(frames)
df_final = df_months.groupby("MONTH", as_index=False).mean()
And the result is:
In [430]: df_final
Out[430]:
MONTH VALVE_SCORE
0 1 44.7250
1 2 38.0375
2 3 30.8000