Neat inter-row calculations within pandas DataFrame with MultiIndex - python

I have the classic ucb admission dataset as a pandas DataFrame with multiIndex:
value
Dept Gender Admit
A Male Admitted 512
Rejected 313
Female Admitted 89
Rejected 19
etc. for other departments ('A' through 'F')
and I want to create a table of the ratio of students accepted to rejected, grouped by Dept and Gender
My current approaches have been
ucbA.groupby(level=['Dept', 'Gender']).apply(lambda x: x.xs('Admitted', level=2).iloc[0] / x.xs('Rejected', level=2).iloc[0]).unstack().value
which is horrible
and
admitted = ucbA.unstack('Admit')
DataFrame({'Proportion Accepted': admitted.value.Admitted / admitted.value.Rejected}).unstack(1)
which ok I guess, but I feel it should be possible as a one-liner without unstacking.
Is there a really neat way of doing something like this? I'm imagining a one-liner staying within the context of the multi-index.
Edit: The full frame:
DataFrame({'Admit': {0: 'Admitted', 1: 'Rejected', 2: 'Admitted', 3: 'Rejected', 4: 'Admitted', 5: 'Rejected', 6: 'Admitted', 7: 'Rejected', 8: 'Admitted', 9: 'Rejected', 10: 'Admitted', 11: 'Rejected', 12: 'Admitted', 13: 'Rejected', 14: 'Admitted', 15: 'Rejected', 16: 'Admitted', 17: 'Rejected', 18: 'Admitted', 19: 'Rejected', 20: 'Admitted', 21: 'Rejected', 22: 'Admitted', 23: 'Rejected'}, 'Dept': {0: 'A', 1: 'A', 2: 'A', 3: 'A', 4: 'B', 5: 'B', 6: 'B', 7: 'B', 8: 'C', 9: 'C', 10: 'C', 11: 'C', 12: 'D', 13: 'D', 14: 'D', 15: 'D', 16: 'E', 17: 'E', 18: 'E', 19: 'E', 20: 'F', 21: 'F', 22: 'F', 23: 'F'}, 'Gender': {0: 'Male', 1: 'Male', 2: 'Female', 3: 'Female', 4: 'Male', 5: 'Male', 6: 'Female', 7: 'Female', 8: 'Male', 9: 'Male', 10: 'Female', 11: 'Female', 12: 'Male', 13: 'Male', 14: 'Female', 15: 'Female', 16: 'Male', 17: 'Male', 18: 'Female', 19: 'Female', 20: 'Male', 21: 'Male', 22: 'Female', 23: 'Female'}, 'value': {0: 512, 1: 313, 2: 89, 3: 19, 4: 353, 5: 207, 6: 17, 7: 8, 8: 120, 9: 205, 10: 202, 11: 391, 12: 138, 13: 279, 14: 131, 15: 244, 16: 53, 17: 138, 18: 94, 19: 299, 20: 22, 21: 351, 22: 24, 23: 317}}).set_index(['Dept', 'Gender', 'Admit']).astype(float).astype(int)
Alternatively if you have rpy:
import pandas.rpy.common as com
ucbA = com.load_data('UCBAdmissions').set_index(['Dept', 'Gender', 'Admit']).astype(float).astype(int)

Here you go:
df = pd.DataFrame({'Dept':['A','A','A','A'],
'Gender':['Male', 'Male', 'Female', 'Female'],
'Admit':['Admitted', 'Rejected', 'Admitted', 'Rejected'],
'value':[512,313,89,19]})
df = df.set_index(['Dept', 'Gender', 'Admit'])
# Proportions accepted and rejected:
df / df.groupby(level=['Dept','Gender']).transform(sum)
# value
#Dept Gender Admit
#A Female Admitted 0.824074
# Rejected 0.175926
# Male Admitted 0.620606
# Rejected 0.379394
# If you really want admitted as fraction of rejected:
df2 = df.swaplevel(1,2).swaplevel(0,1)
df2.ix['Admitted'] / df2.ix['Rejected']
# value
#Dept Gender
#A Male 1.635783
# Female 4.684211

Here's a way
In [55]: grouper = ['Dept','Gender']
In [56]: x = df.reset_index()
In [57]: (x[x.Admit=='Admitted'].groupby(grouper).sum() /
x[x.Admit=='Rejected'].groupby(grouper).sum()
).unstack()
Out[57]:
value
Gender Female Male
Dept
A 4.684211 1.635783
B 2.125000 1.705314
C 0.516624 0.585366
D 0.536885 0.494624
E 0.314381 0.384058
F 0.075710 0.062678
[6 rows x 2 columns]

Related

How to put values on a single raw from multiple columns in Pandas

I have been scratching my head for days about this problem. Please, find below the structure of my input data and the output that I want.
I color-coded per ID, Plot, Survey, Trial and the 3 estimation methods.
In the output, I want to get all the scorings for each group, which are represented by color, on the same row. By doing that, we should get rid of the Estimation Method column in the output. I kept it for sake of clarity.
This is my code. Thank you in advance for your time.
import pandas as pd
import functools
data_dict = {'ID': {0: 'id1',
1: 'id1',
2: 'id1',
3: 'id1',
4: 'id1',
5: 'id1',
6: 'id1',
7: 'id1',
8: 'id1',
9: 'id1',
10: 'id1',
11: 'id1',
12: 'id1',
13: 'id1',
14: 'id1',
15: 'id1',
16: 'id1',
17: 'id1',
18: 'id1',
19: 'id1',
20: 'id1',
21: 'id1',
22: 'id1',
23: 'id1'},
'Plot': {0: 'p1',
1: 'p1',
2: 'p1',
3: 'p1',
4: 'p1',
5: 'p1',
6: 'p1',
7: 'p1',
8: 'p1',
9: 'p1',
10: 'p1',
11: 'p1',
12: 'p1',
13: 'p1',
14: 'p1',
15: 'p1',
16: 'p1',
17: 'p1',
18: 'p1',
19: 'p1',
20: 'p1',
21: 'p1',
22: 'p1',
23: 'p1'},
'Survey': {0: 'Sv1',
1: 'Sv1',
2: 'Sv1',
3: 'Sv1',
4: 'Sv1',
5: 'Sv1',
6: 'Sv2',
7: 'Sv2',
8: 'Sv2',
9: 'Sv2',
10: 'Sv2',
11: 'Sv2',
12: 'Sv1',
13: 'Sv1',
14: 'Sv1',
15: 'Sv1',
16: 'Sv1',
17: 'Sv1',
18: 'Sv2',
19: 'Sv2',
20: 'Sv2',
21: 'Sv2',
22: 'Sv2',
23: 'Sv2'},
'Trial': {0: 't1',
1: 't1',
2: 't1',
3: 't2',
4: 't2',
5: 't2',
6: 't1',
7: 't1',
8: 't1',
9: 't2',
10: 't2',
11: 't2',
12: 't1',
13: 't1',
14: 't1',
15: 't2',
16: 't2',
17: 't2',
18: 't1',
19: 't1',
20: 't1',
21: 't2',
22: 't2',
23: 't2'},
'Mission': {0: 'mission1',
1: 'mission1',
2: 'mission1',
3: 'mission1',
4: 'mission1',
5: 'mission1',
6: 'mission1',
7: 'mission1',
8: 'mission1',
9: 'mission1',
10: 'mission1',
11: 'mission2',
12: 'mission2',
13: 'mission2',
14: 'mission2',
15: 'mission2',
16: 'mission2',
17: 'mission2',
18: 'mission2',
19: 'mission2',
20: 'mission2',
21: 'mission2',
22: 'mission2',
23: 'mission2'},
'Estimation Method': {0: 'MCARI2',
1: 'NDVI',
2: 'NDRE',
3: 'MCARI2',
4: 'NDVI',
5: 'NDRE',
6: 'MCARI2',
7: 'NDVI',
8: 'NDRE',
9: 'MCARI2',
10: 'NDVI',
11: 'NDRE',
12: 'MCARI2',
13: 'NDVI',
14: 'NDRE',
15: 'MCARI2',
16: 'NDVI',
17: 'NDRE',
18: 'MCARI2',
19: 'NDVI',
20: 'NDRE',
21: 'MCARI2',
22: 'NDVI',
23: 'NDRE'},
'MCARI2_sd': {0: 1.5,
1: np.nan,
2: np.nan,
3: 10.0,
4: np.nan,
5: np.nan,
6: 1.5,
7: np.nan,
8: np.nan,
9: 10.0,
10: np.nan,
11: np.nan,
12: 101.0,
13: np.nan,
14: np.nan,
15: 23.5,
16: np.nan,
17: np.nan,
18: 111.0,
19: np.nan,
20: np.nan,
21: 72.0,
22: np.nan,
23: np.nan},
'MACRI2_50': {0: 12.4,
1: np.nan,
2: np.nan,
3: 11.0,
4: np.nan,
5: np.nan,
6: 12.4,
7: np.nan,
8: np.nan,
9: 11.0,
10: np.nan,
11: np.nan,
12: 102.0,
13: np.nan,
14: np.nan,
15: 2.1,
16: np.nan,
17: np.nan,
18: 112.0,
19: np.nan,
20: np.nan,
21: 74.0,
22: np.nan,
23: np.nan},
'MACRI2_AVG': {0: 15.0,
1: np.nan,
2: np.nan,
3: 12.0,
4: np.nan,
5: np.nan,
6: 15.0,
7: np.nan,
8: np.nan,
9: 12.0,
10: np.nan,
11: np.nan,
12: 103.0,
13: np.nan,
14: np.nan,
15: 24.0,
16: np.nan,
17: np.nan,
18: 113.0,
19: np.nan,
20: np.nan,
21: 77.0,
22: np.nan,
23: np.nan},
'NDVI_sd': {0: np.nan,
1: 2.9,
2: np.nan,
3: np.nan,
4: 20.0,
5: np.nan,
6: np.nan,
7: 2.9,
8: np.nan,
9: np.nan,
10: 20.0,
11: np.nan,
12: np.nan,
13: 201.0,
14: np.nan,
15: np.nan,
16: 11.0,
17: np.nan,
18: np.nan,
19: 200.0,
20: np.nan,
21: np.nan,
22: 32.0,
23: np.nan},
'NDVI_50': {0: np.nan,
1: 21.0,
2: np.nan,
3: np.nan,
4: 21.0,
5: np.nan,
6: np.nan,
7: 21.0,
8: np.nan,
9: np.nan,
10: 21.0,
11: np.nan,
12: np.nan,
13: 201.0,
14: np.nan,
15: np.nan,
16: 12.0,
17: np.nan,
18: np.nan,
19: 300.0,
20: np.nan,
21: np.nan,
22: 39.0,
23: np.nan},
'NDVI_AVG': {0: np.nan,
1: 27.0,
2: np.nan,
3: np.nan,
4: 22.0,
5: np.nan,
6: np.nan,
7: 27.0,
8: np.nan,
9: np.nan,
10: 22.0,
11: np.nan,
12: np.nan,
13: 203.0,
14: np.nan,
15: np.nan,
16: 13.0,
17: np.nan,
18: np.nan,
19: 400.0,
20: np.nan,
21: np.nan,
22: 40.0,
23: np.nan},
'NDRE_sd': {0: np.nan,
1: np.nan,
2: 3.1,
3: np.nan,
4: np.nan,
5: 31.0,
6: np.nan,
7: np.nan,
8: 3.1,
9: np.nan,
10: np.nan,
11: 31.0,
12: np.nan,
13: np.nan,
14: 301.0,
15: np.nan,
16: np.nan,
17: 15.0,
18: np.nan,
19: np.nan,
20: 57.0,
21: np.nan,
22: np.nan,
23: 21.0},
'NDRE_50': {0: np.nan,
1: np.nan,
2: 33.0,
3: np.nan,
4: np.nan,
5: 32.0,
6: np.nan,
7: np.nan,
8: 33.0,
9: np.nan,
10: np.nan,
11: 32.0,
12: np.nan,
13: np.nan,
14: 302.0,
15: np.nan,
16: np.nan,
17: 16.0,
18: np.nan,
19: np.nan,
20: 58.0,
21: np.nan,
22: np.nan,
23: 22.0},
'NDRE_AVG': {0: np.nan,
1: np.nan,
2: 330.0,
3: np.nan,
4: np.nan,
5: 33.0,
6: np.nan,
7: np.nan,
8: 330.0,
9: np.nan,
10: np.nan,
11: 33.0,
12: np.nan,
13: np.nan,
14: 303.0,
15: np.nan,
16: np.nan,
17: 17.0,
18: np.nan,
19: np.nan,
20: 59.0,
21: np.nan,
22: np.nan,
23: 32.0}}
df_test = pd.DataFrame(data_dict)
def generate_data_per_EM(df):
data_survey = []
for (survey,mission,trial,em),data in df.groupby(['Survey','Mission','Trial','Estimation Method']):
df_em = data.set_index('ID').dropna(axis=1)
df_em.to_csv(f'tmp_data_{survey}_{mission}_{trial}_{em}.csv') #This generates 74 files, but not sure how to join/merge them
data_survey.append(df_em)
#Merge the df_em column-wise
df_final = functools.reduce(lambda left, right: pd.merge(left, right, on=['ID','Survey','Mission','Trial']), data_survey)
df_final.to_csv(f'final_{survey}_{mission}_{em}.csv') #Output is not what I expected
generate_data_per_EM(df_test)
You need a groupby:
(df_test
.groupby(['ID', 'Plot', 'Survey', 'Trial','Mission'], as_index=False, sort=False)
.first(numeric_only=True)
ID Plot Survey Trial Mission MCARI2_sd MACRI2_50 MACRI2_AVG NDVI_sd NDVI_50 NDVI_AVG NDRE_sd NDRE_50 NDRE_AVG
0 id1 p1 Sv1 t1 mission1 1.5 12.4 15.0 2.9 21.0 27.0 3.1 33.0 330.0
1 id1 p1 Sv1 t2 mission1 10.0 11.0 12.0 20.0 21.0 22.0 31.0 32.0 33.0
2 id1 p1 Sv2 t1 mission1 1.5 12.4 15.0 2.9 21.0 27.0 3.1 33.0 330.0
3 id1 p1 Sv2 t2 mission1 10.0 11.0 12.0 20.0 21.0 22.0 NaN NaN NaN
4 id1 p1 Sv2 t2 mission2 72.0 74.0 77.0 32.0 39.0 40.0 31.0 32.0 33.0
5 id1 p1 Sv1 t1 mission2 101.0 102.0 103.0 201.0 201.0 203.0 301.0 302.0 303.0
6 id1 p1 Sv1 t2 mission2 23.5 2.1 24.0 11.0 12.0 13.0 15.0 16.0 17.0
7 id1 p1 Sv2 t1 mission2 111.0 112.0 113.0 200.0 300.0 400.0 57.0 58.0 59.0

hvplot.errorbars - linking the error bars to line/scatterplots

I have the following dataframe, containing the mean and standard deviations of data, as well as other descriptors.
{'Person': {0: 'Mark',
1: 'Mark',
2: 'Mark',
3: 'Mark',
4: 'Mark',
5: 'Mark',
6: 'Mark',
7: 'Mark',
8: 'Mark',
9: 'Mark',
10: 'Mark',
11: 'Mark',
12: 'Mark',
13: 'Mark',
14: 'Mark',
15: 'Mark',
16: 'Mark',
17: 'Mark',
18: 'Mark',
19: 'Mark',
20: 'Mark',
21: 'Mark',
22: 'John',
23: 'John',
24: 'John',
25: 'John',
26: 'John',
27: 'John',
28: 'John',
29: 'John',
30: 'John',
31: 'John',
32: 'John',
33: 'John',
34: 'John',
35: 'John',
36: 'John',
37: 'John',
38: 'John',
39: 'John',
40: 'John',
41: 'John',
42: 'John',
43: 'John'},
'Alcohol': {0: 'No',
1: 'No',
2: 'No',
3: 'No',
4: 'No',
5: 'No',
6: 'No',
7: 'No',
8: 'No',
9: 'No',
10: 'No',
11: 'Yes',
12: 'Yes',
13: 'Yes',
14: 'Yes',
15: 'Yes',
16: 'Yes',
17: 'Yes',
18: 'Yes',
19: 'Yes',
20: 'Yes',
21: 'Yes',
22: 'No',
23: 'No',
24: 'No',
25: 'No',
26: 'No',
27: 'No',
28: 'No',
29: 'No',
30: 'No',
31: 'No',
32: 'No',
33: 'Yes',
34: 'Yes',
35: 'Yes',
36: 'Yes',
37: 'Yes',
38: 'Yes',
39: 'Yes',
40: 'Yes',
41: 'Yes',
42: 'Yes',
43: 'Yes'},
'Product': {0: 'Orange',
1: 'Orange',
2: 'Orange',
3: 'Orange',
4: 'Orange',
5: 'Apple',
6: 'Apple',
7: 'Apple',
8: 'Apple',
9: 'Apple',
10: 'Apple',
11: 'Orange',
12: 'Orange',
13: 'Orange',
14: 'Orange',
15: 'Orange',
16: 'Apple',
17: 'Apple',
18: 'Apple',
19: 'Apple',
20: 'Apple',
21: 'Apple',
22: 'Orange',
23: 'Orange',
24: 'Orange',
25: 'Orange',
26: 'Orange',
27: 'Apple',
28: 'Apple',
29: 'Apple',
30: 'Apple',
31: 'Apple',
32: 'Apple',
33: 'Orange',
34: 'Orange',
35: 'Orange',
36: 'Orange',
37: 'Orange',
38: 'Apple',
39: 'Apple',
40: 'Apple',
41: 'Apple',
42: 'Apple',
43: 'Apple'},
'Concentration': {0: 0,
1: 10,
2: 20,
3: 30,
4: 40,
5: 0,
6: 10,
7: 20,
8: 30,
9: 40,
10: 50,
11: 0,
12: 10,
13: 20,
14: 30,
15: 40,
16: 0,
17: 10,
18: 20,
19: 30,
20: 40,
21: 50,
22: 0,
23: 10,
24: 20,
25: 30,
26: 40,
27: 0,
28: 10,
29: 20,
30: 30,
31: 40,
32: 50,
33: 0,
34: 10,
35: 20,
36: 30,
37: 40,
38: 0,
39: 10,
40: 20,
41: 30,
42: 40,
43: 50},
'Response': {0: 4,
1: 10,
2: 25,
3: 31,
4: 48,
5: 10,
6: 22,
7: 35,
8: 46,
9: 56,
10: 61,
11: 24,
12: 30,
13: 45,
14: 51,
15: 68,
16: 30,
17: 42,
18: 55,
19: 66,
20: 76,
21: 81,
22: 17,
23: 23,
24: 38,
25: 44,
26: 61,
27: 23,
28: 35,
29: 48,
30: 59,
31: 69,
32: 74,
33: 37,
34: 43,
35: 58,
36: 64,
37: 81,
38: 43,
39: 55,
40: 68,
41: 79,
42: 89,
43: 94},
'Response mean': {0: 4.333333333,
1: 15.0,
2: 24.33333333,
3: 35.33333333,
4: 45.33333333,
5: 12.33333333,
6: 24.66666667,
7: 34.33333333,
8: 45.0,
9: 57.66666667,
10: 55.66666667,
11: 24.33333333,
12: 35.0,
13: 44.33333333,
14: 55.33333333,
15: 65.33333333,
16: 32.33333333,
17: 44.66666667,
18: 54.33333333,
19: 65.0,
20: 77.66666667,
21: 75.66666667,
22: 17.33333333,
23: 28.0,
24: 37.33333333,
25: 48.33333333,
26: 58.33333333,
27: 25.33333333,
28: 37.66666667,
29: 47.33333333,
30: 58.0,
31: 70.66666667,
32: 68.66666667,
33: 37.33333333,
34: 48.0,
35: 57.33333333,
36: 68.33333333,
37: 78.33333333,
38: 45.33333333,
39: 57.66666667,
40: 67.33333333,
41: 78.0,
42: 90.66666667,
43: 88.66666667},
'Response SD': {0: 1.527525232,
1: 4.582575695,
2: 2.081665999,
3: 4.041451884,
4: 2.516611478,
5: 2.516611478,
6: 3.055050463,
7: 2.081665999,
8: 1.0,
9: 1.527525232,
10: 14.74222959,
11: 1.527525232,
12: 4.582575695,
13: 2.081665999,
14: 4.041451884,
15: 2.516611478,
16: 2.516611478,
17: 3.055050463,
18: 2.081665999,
19: 1.0,
20: 1.527525232,
21: 14.74222959,
22: 1.527525232,
23: 4.582575695,
24: 2.081665999,
25: 4.041451884,
26: 2.516611478,
27: 2.516611478,
28: 3.055050463,
29: 2.081665999,
30: 1.0,
31: 1.527525232,
32: 14.74222959,
33: 1.527525232,
34: 4.582575695,
35: 2.081665999,
36: 4.041451884,
37: 2.516611478,
38: 2.516611478,
39: 3.055050463,
40: 2.081665999,
41: 1.0,
42: 1.527525232,
43: 14.74222959}}
I've starting using hvplot because of the interactivity it offers to explore data, and from what I can tell, its based on Bokeh. In the example code below, I can plot the mean values (using scatter to give me the glyphs) and the line through the glyphs (using line) and overlaying them with the scatter * line code. However, as I need to plot the standard deviations, I'm using the errorbar plot too, which works great (note that I use scatter * line * error). However, when I click on the legend to remove certain data, the scatter and line is removed (note that I have muted_alpha=0 for the line and scatter, but there is no such option for the error bars), but the error bars stay on the plot. Its as though the scatter and line are 'linked' together, but the errorbars isn't.
line = df2.hvplot.line(x='Concentration', y='Response mean', by='Product', groupby= ['Person', 'Alcohol'], muted_alpha=0)
scatter = df2.hvplot.scatter(x='Concentration', y='Response mean', by='Product', groupby=['Person', 'Alcohol'], marker='o', size=40, muted_alpha=0)
error = df2.hvplot.errorbars(x='Concentration', y='Response mean', yerr1='Response SD', by='Product', groupby=['Person', 'Alcohol'])
all_plots = scatter * line * error
all_plots
Can anyone help me to link the errorbars to its corresponding data, so when I click on the legend, the error bars, scatter and line are removed from the plot?
Thank you in advance!

networkx referring to multi-dimensions and displaying data

new to network graphs so was hoping for a little guidance...
I'm trying to create a network graph between users collaborating with each other, my problem is I cannot figure out how to add multiple dimensions to the network.
At a high level I want to show:
User-to-User interactions
Add some sort of size indication of users who are collaborating more (via the edges, the more interactions the thicker the line between the two users).
Add color to the edges/lines indicating the project they worked on together
Add color to node based on user license type
So something like:
So far I have the following:
import pandas as pd
import networkx as nx
from pyvis.network import Network
df_dict = {'PROJECT': {0: 'Finance Project', 1: 'Finance Project', 2: 'Finance Project', 3: 'Finance Project', 4: 'Finance Project', 5: 'Finance Project', 6: 'Finance Project', 7: 'Finance Project', 8: 'Finance Project', 9: 'Finance Project', 10: 'Finance Project', 11: 'Finance Project', 12: 'HR Project', 13: 'Finance Project', 14: 'HR Project', 15: 'Finance Project'},
'PLAN': {0: 'COMPANY', 1: 'COMPANY', 2: 'COMPANY', 3: 'COMPANY', 4: 'COMPANY', 5: 'COMPANY', 6: 'COMPANY', 7: 'COMPANY', 8: 'COMPANY', 9: 'COMPANY', 10: 'COMPANY', 11: 'COMPANY', 12: 'COMPANY', 13: 'COMPANY', 14: 'COMPANY', 15: 'COMPANY'},
'USER_ONE': {0: 'Mike Jones', 1: 'Eminem', 2: 'Mike Jones', 3: 'Mike Jones', 4: 'Michael Jordan', 5: 'Eminem', 6: 'Michael Jordan', 7: 'Michael Jordan', 8: 'Mike Jones', 9: 'Kobe Bryant', 10: 'Eminem', 11: 'Elon Musk', 12: 'Bill Gates', 13: 'Elon Musk', 14: 'Mark Zuckerberg', 15: 'Elon Musk'},
'USER_ONE_LICENSE': {0: 'FULL', 1: 'FULL', 2: 'FULL', 3: 'FULL', 4: 'FULL', 5: 'FULL', 6: 'FULL', 7: 'FULL', 8: 'FULL', 9: 'OCCASIONAL', 10: 'FULL', 11: 'FULL', 12: 'FULL', 13: 'FULL', 14: 'FULL', 15: 'FULL'},
'USER_ONE_LICENSE_COLOR': {0: 'lightgreen', 1: 'lightgreen', 2: 'lightgreen', 3: 'lightgreen', 4: 'lightgreen', 5: 'lightgreen', 6: 'lightgreen', 7: 'lightgreen', 8: 'lightgreen', 9: 'gray', 10: 'lightgreen', 11: 'lightgreen', 12: 'lightgreen', 13: 'lightgreen', 14: 'lightgreen', 15: 'lightgreen'},
'USER_ONE_DAYS_COLLAB': {0: 88, 1: 55, 2: 67, 3: 1, 4: 70, 5: 54, 6: 2, 7: 114, 8: 4, 9: 1, 10: 10, 11: 19, 12: 5, 13: 11, 14: 100, 15: 13},
'USER_TWO': {0: 'Michael Jordan', 1: 'Mike Jones', 2: 'Eminem', 3: 'Kobe Bryant', 4: 'Eminem', 5: 'Michael Jordan', 6: 'Elon Musk', 7: 'Mike Jones', 8: 'Elon Musk', 9: 'Mike Jones', 10: 'Elon Musk', 11: 'Eminem', 12: 'Mark Zuckerberg', 13: 'Michael Jordan', 14: 'Bill Gates', 15: 'Mike Jones'},
'USER_TWO_LICENSE': {0: 'FULL', 1: 'FULL', 2: 'FULL', 3: 'OCCASIONAL', 4: 'FULL', 5: 'FULL', 6: 'FULL', 7: 'FULL', 8: 'FULL', 9: 'FULL', 10: 'FULL', 11: 'FULL', 12: 'FULL', 13: 'FULL', 14: 'FULL', 15: 'FULL'},
'USER_TWO_LICENSE_COLOR': {0: 'lightgreen', 1: 'lightgreen', 2: 'lightgreen', 3: 'gray', 4: 'lightgreen', 5: 'lightgreen', 6: 'lightgreen', 7: 'lightgreen', 8: 'lightgreen', 9: 'lightgreen', 10: 'lightgreen', 11: 'lightgreen', 12: 'lightgreen', 13: 'lightgreen', 14: 'lightgreen', 15: 'lightgreen'},
'USER_TWO_DAYS_COLLAB': {0: 114, 1: 67, 2: 55, 3: 1, 4: 54, 5: 70, 6: 11, 7: 88, 8: 13, 9: 1, 10: 19, 11: 10, 12: 100, 13: 2, 14: 5, 15: 4}
, 'TOTAL_COLLABS': {0: 202, 1: 122, 2: 122, 3: 2, 4: 124, 5: 124, 6: 13, 7: 202, 8: 17, 9: 2, 10: 29, 11: 29, 12: 105, 13: 13, 14: 105, 15: 17}}
df = pd.DataFrame(df_dict)
# where do I add all the other attributes?
#i.e. license type, project, # of interactions (I'm assuming this can be something like weights?)
#In my case I believe my source + target needs to be Project + User?
G = nx.from_pandas_edgelist(df
,source='USER_ONE'
,target='USER_TWO' #I tried ['PROJECT','USER_TWO']
)
net = Network(notebook=True)
net.from_nx(G)
net.show_buttons(filter_=True)
net.show('example4.html')
All the examples I've seen only have one source and one target - mine needs user + project for both source and target. Is there a way to do this without creating one field that combines both?
Haven't found a clear way to color nodes , the example provided shows a case on the node value (in my case the node is just text, i have another dimension i want to refer to to dictate color)
Haven't found a clear way to build a case statement on edge width either. Concretely:
if count of interactions <= 1 then "small width"
if count of interactions > 1 and <=5 then "medium width"
etc...
Any direction or resources would be greatly appreciated -- everything I come across seems to be different than my setup leaving me unsure how to proceed.
my table looks something like this for reference:

How to add lines with annotations to candlestick charts when some values are missing?

I'm trying to use Plotly to overlay a marker/line chart on top of my OHLC candle chart.
Code
import plotly.graph_objects as go
import pandas as pd
from datetime import datetime
df = pd.DataFrame(
{'index': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20,
21: 21,
22: 22,
23: 23,
24: 24},
'Date': {0: '2018-09-03',
1: '2018-09-04',
2: '2018-09-05',
3: '2018-09-06',
4: '2018-09-07',
5: '2018-09-10',
6: '2018-09-11',
7: '2018-09-12',
8: '2018-09-13',
9: '2018-09-14',
10: '2018-09-17',
11: '2018-09-18',
12: '2018-09-19',
13: '2018-09-20',
14: '2018-09-21',
15: '2018-09-24',
16: '2018-09-25',
17: '2018-09-26',
18: '2018-09-27',
19: '2018-09-28',
20: '2018-10-01',
21: '2018-10-02',
22: '2018-10-03',
23: '2018-10-04',
24: '2018-10-05'},
'Open': {0: 1.2922067642211914,
1: 1.2867859601974487,
2: 1.2859420776367188,
3: 1.2914056777954102,
4: 1.2928247451782229,
5: 1.292808175086975,
6: 1.3027958869934082,
7: 1.3017443418502808,
8: 1.30451238155365,
9: 1.3110626935958862,
10: 1.3071041107177734,
11: 1.3146650791168213,
12: 1.3166556358337402,
13: 1.3140604496002195,
14: 1.3271400928497314,
15: 1.3080958127975464,
16: 1.3117163181304932,
17: 1.3180439472198486,
18: 1.3169677257537842,
19: 1.3077707290649414,
20: 1.3039510250091553,
21: 1.3043931722640991,
22: 1.2979763746261597,
23: 1.2941633462905884,
24: 1.3022021055221558},
'High': {0: 1.2934937477111816,
1: 1.2870012521743774,
2: 1.2979259490966797,
3: 1.2959914207458496,
4: 1.3024225234985352,
5: 1.3052103519439695,
6: 1.30804443359375,
7: 1.3044441938400269,
8: 1.3120088577270508,
9: 1.3143367767333984,
10: 1.3156682252883911,
11: 1.3171066045761108,
12: 1.3211784362792969,
13: 1.3296104669570925,
14: 1.3278449773788452,
15: 1.3166556358337402,
16: 1.3175750970840454,
17: 1.3196094036102295,
18: 1.3180439472198486,
19: 1.3090718984603882,
20: 1.3097577095031738,
21: 1.3049719333648682,
22: 1.3020155429840088,
23: 1.3036959171295166,
24: 1.310753345489502},
'Low': {0: 1.2856279611587524,
1: 1.2813942432403564,
2: 1.2793285846710205,
3: 1.289723515510559,
4: 1.2918561697006226,
5: 1.289823293685913,
6: 1.2976733446121216,
7: 1.298414707183838,
8: 1.3027619123458862,
9: 1.3073604106903076,
10: 1.3070186376571655,
11: 1.3120776414871216,
12: 1.3120431900024414,
13: 1.3140085935592651,
14: 1.305841088294983,
15: 1.3064552545547483,
16: 1.3097233772277832,
17: 1.3141123056411743,
18: 1.309706211090088,
19: 1.3002548217773438,
20: 1.3014055490493774,
21: 1.2944146394729614,
22: 1.2964619398117063,
23: 1.2924572229385376,
24: 1.3005592823028564},
'Close': {0: 1.292306900024414,
1: 1.2869019508361816,
2: 1.2858428955078125,
3: 1.2914891242980957,
4: 1.2925406694412231,
5: 1.2930254936218262,
6: 1.302643060684204,
7: 1.3015578985214231,
8: 1.304546356201172,
9: 1.311131477355957,
10: 1.307326316833496,
11: 1.3146305084228516,
12: 1.3168463706970217,
13: 1.3141123056411743,
14: 1.327087163925171,
15: 1.30804443359375,
16: 1.3117333650588991,
17: 1.3179919719696045,
18: 1.3172800540924072,
19: 1.3078734874725342,
20: 1.3039000034332275,
21: 1.3043591976165771,
22: 1.2981956005096436,
23: 1.294062852859497,
24: 1.3024225234985352},
'Pivot Price': {0: 1.2934937477111816,
1: np.nan,
2: 1.2793285846710205,
3: np.nan,
4: np.nan,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: np.nan,
11: np.nan,
12: np.nan,
13: 1.3296104669570925,
14: np.nan,
15: np.nan,
16: np.nan,
17: np.nan,
18: np.nan,
19: np.nan,
20: np.nan,
21: np.nan,
22: np.nan,
23: 1.2924572229385376,
24: np.nan}})
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
open=df['Open'],
high=df['High'],
low=df['Low'],
close=df['Close'])])
fig.add_trace(
go.Scatter(mode = "lines+markers",
x=df['Date'],
y=df["Pivot Price"]
))
fig.update_layout(
autosize=False,
width=1000,
height=800,)
fig.show()
This is the current image
This is the desired output/image
I want black line between the markers (pivots). I would also ideally like a value next to each line showing the distance between each pivot but Im not sure how to do this.
For example the distance between the first two pivots round(abs(1.293494 - 1.279329),3) returns 0.014 so I would ideally like this next to the line.
The second is round(abs(1.279329 - 1.329610),3) so the value would be 0.05. I have hand edited the image and added the lines for the first two values to give a visual representation of what Im trying to achieve.
The problem seems to be the missing values. So just use pandas.Series.interpolate in combination with fig.add_annotation to get:
I've included annotations for differences as well. There are surely more elegant ways to do it than with for loops, but it does the job. Let me know if anything is unclear!
import pandas as pd
import numpy as np
import plotly.graph_objects as go
df = pd.DataFrame(
{'index': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20,
21: 21,
22: 22,
23: 23,
24: 24},
'Date': {0: '2018-09-03',
1: '2018-09-04',
2: '2018-09-05',
3: '2018-09-06',
4: '2018-09-07',
5: '2018-09-10',
6: '2018-09-11',
7: '2018-09-12',
8: '2018-09-13',
9: '2018-09-14',
10: '2018-09-17',
11: '2018-09-18',
12: '2018-09-19',
13: '2018-09-20',
14: '2018-09-21',
15: '2018-09-24',
16: '2018-09-25',
17: '2018-09-26',
18: '2018-09-27',
19: '2018-09-28',
20: '2018-10-01',
21: '2018-10-02',
22: '2018-10-03',
23: '2018-10-04',
24: '2018-10-05'},
'Open': {0: 1.2922067642211914,
1: 1.2867859601974487,
2: 1.2859420776367188,
3: 1.2914056777954102,
4: 1.2928247451782229,
5: 1.292808175086975,
6: 1.3027958869934082,
7: 1.3017443418502808,
8: 1.30451238155365,
9: 1.3110626935958862,
10: 1.3071041107177734,
11: 1.3146650791168213,
12: 1.3166556358337402,
13: 1.3140604496002195,
14: 1.3271400928497314,
15: 1.3080958127975464,
16: 1.3117163181304932,
17: 1.3180439472198486,
18: 1.3169677257537842,
19: 1.3077707290649414,
20: 1.3039510250091553,
21: 1.3043931722640991,
22: 1.2979763746261597,
23: 1.2941633462905884,
24: 1.3022021055221558},
'High': {0: 1.2934937477111816,
1: 1.2870012521743774,
2: 1.2979259490966797,
3: 1.2959914207458496,
4: 1.3024225234985352,
5: 1.3052103519439695,
6: 1.30804443359375,
7: 1.3044441938400269,
8: 1.3120088577270508,
9: 1.3143367767333984,
10: 1.3156682252883911,
11: 1.3171066045761108,
12: 1.3211784362792969,
13: 1.3296104669570925,
14: 1.3278449773788452,
15: 1.3166556358337402,
16: 1.3175750970840454,
17: 1.3196094036102295,
18: 1.3180439472198486,
19: 1.3090718984603882,
20: 1.3097577095031738,
21: 1.3049719333648682,
22: 1.3020155429840088,
23: 1.3036959171295166,
24: 1.310753345489502},
'Low': {0: 1.2856279611587524,
1: 1.2813942432403564,
2: 1.2793285846710205,
3: 1.289723515510559,
4: 1.2918561697006226,
5: 1.289823293685913,
6: 1.2976733446121216,
7: 1.298414707183838,
8: 1.3027619123458862,
9: 1.3073604106903076,
10: 1.3070186376571655,
11: 1.3120776414871216,
12: 1.3120431900024414,
13: 1.3140085935592651,
14: 1.305841088294983,
15: 1.3064552545547483,
16: 1.3097233772277832,
17: 1.3141123056411743,
18: 1.309706211090088,
19: 1.3002548217773438,
20: 1.3014055490493774,
21: 1.2944146394729614,
22: 1.2964619398117063,
23: 1.2924572229385376,
24: 1.3005592823028564},
'Close': {0: 1.292306900024414,
1: 1.2869019508361816,
2: 1.2858428955078125,
3: 1.2914891242980957,
4: 1.2925406694412231,
5: 1.2930254936218262,
6: 1.302643060684204,
7: 1.3015578985214231,
8: 1.304546356201172,
9: 1.311131477355957,
10: 1.307326316833496,
11: 1.3146305084228516,
12: 1.3168463706970217,
13: 1.3141123056411743,
14: 1.327087163925171,
15: 1.30804443359375,
16: 1.3117333650588991,
17: 1.3179919719696045,
18: 1.3172800540924072,
19: 1.3078734874725342,
20: 1.3039000034332275,
21: 1.3043591976165771,
22: 1.2981956005096436,
23: 1.294062852859497,
24: 1.3024225234985352},
'Pivot Price': {0: 1.2934937477111816,
1: np.nan,
2: 1.2793285846710205,
3: np.nan,
4: np.nan,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: np.nan,
11: np.nan,
12: np.nan,
13: 1.3296104669570925,
14: np.nan,
15: np.nan,
16: np.nan,
17: np.nan,
18: np.nan,
19: np.nan,
20: np.nan,
21: np.nan,
22: np.nan,
23: 1.2924572229385376,
24: np.nan}})
import plotly.graph_objects as go
import pandas as pd
from datetime import datetime
# df=pd.read_csv("for_so.csv")
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
# fig = go.Figure(data=[go.Candlestick(x=df.index,
open=df['Open'],
high=df['High'],
low=df['Low'],
close=df['Close'])])
# some calculations
df_diff = df['Pivot Price'].dropna().diff().copy()
df2 = df[df.index.isin(df_diff.index)].copy()
df2['Price Diff'] = df['Pivot Price'].dropna().values
fig.add_trace(
go.Scatter(mode = "lines+markers",
x=df['Date'],
y=df["Pivot Price"]
))
fig.update_layout(
autosize=False,
width=1000,
height=800,)
fig.add_trace(go.Scatter(x=df['Date'], y=df['Pivot Price'].interpolate(),
# fig.add_trace(go.Scatter(x=df.index, y=df['Pivot Price'].interpolate(),
mode = 'lines',
line = dict(color='black')))
def annot(value):
# print(type(value))
if np.isnan(value):
return ''
else:
return value
j = 0
for i, p in enumerate(df['Pivot Price']):
# print(p)
# if not np.isnan(p) and not np.isnan(df_diff.iloc[j]):
if not np.isnan(p):
# print(not np.isnan(df_diff.iloc[j]))
fig.add_annotation(dict(font=dict(color='rgba(0,0,200,0.8)',size=12),
x=df['Date'].iloc[i],
# x=df.index[i],
# x = xStart
y=p,
showarrow=False,
text=annot(round(abs(df_diff.iloc[j]),3)),
textangle=0,
xanchor='right',
xref="x",
yref="y"))
j = j + 1
fig.update_xaxes(type='category')
fig.show()
Problem seems the missing values, plotly has difficulty with. With this trick you can only plot the point;
has_value = ~df["Pivot Price"].isna()
import plotly.graph_objects as go
import pandas as pd
from datetime import datetime
df=pd.read_csv("notebooks/for_so.csv")
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
open=df['Open'],
high=df['High'],
low=df['Low'],
close=df['Close'])])
fig.add_trace(
go.Scatter(mode = 'lines',
x=df[has_value]['Date'],
y=df[has_value]["Pivot Price"], line={'color':'black', 'width':1}
))
fig.add_trace(
go.Scatter(mode = "markers",
x=df['Date'],
y=df["Pivot Price"]
))
fig.update_layout(
autosize=False,
width=1000,
height=800,)
fig.show()
This did it for me.

How to remove duplicates based on lower frequency [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 2 years ago.
I have a table that looks like this
I want to be able to keep ids for brands that have highest freq. For example in case of audi both ids have same frequencies so keep only one. In case of mercedes-benz keep the latter one since it has frequency 7.
This is my dataframe:
{'Brand':
{0: 'audi',
1: 'audi',
2: 'bmw',
3: 'dacia',
4: 'fiat',
5: 'ford',
6: 'ford',
7: 'honda',
8: 'honda',
9: 'hyundai',
10: 'kia',
11: 'mercedes-benz',
12: 'mercedes-benz',
13: 'nissan',
14: 'nissan',
15: 'opel',
16: 'renault',
17: 'renault',
18: 'renault',
19: 'renault',
20: 'toyota',
21: 'toyota',
22: 'volvo',
23: 'vw',
24: 'vw',
25: 'vw',
26: 'vw'},
'id':
{0: 'audi_a4_dynamic_2016_otomatik',
1: 'audi_a6_standart_2015_otomatik',
2: 'bmw_5 series_executive_2016_otomatik',
3: 'dacia_duster_laureate_2017_manuel',
4: 'fiat_egea_easy_2017_manuel',
5: 'ford_focus_trend x_2015_manuel',
6: 'ford_focus_trend x_2015_otomatik',
7: 'honda_civic_eco elegance_2017_otomatik',
8: 'honda_cr-v_executive_2018_otomatik',
9: 'hyundai_tucson_elite plus_2017_otomatik',
10: 'kia_sportage_concept plus_2015_otomatik',
11: 'mercedes-benz_c-class_amg_2016_otomatik',
12: 'mercedes-benz_e-class_edition e_2015_otomatik',
13: 'nissan_qashqai_black edition_2014_manuel',
14: 'nissan_qashqai_sky pack_2015_otomatik',
15: 'opel_astra_edition_2016_manuel',
16: 'renault_clio_joy_2016_manuel',
17: 'renault_kadjar_icon_2015_otomatik',
18: 'renault_kadjar_icon_2016_otomatik',
19: 'renault_mégane_touch_2017_otomatik',
20: 'toyota_corolla_touch_2015_otomatik',
21: 'toyota_corolla_touch_2016_otomatik',
22: 'volvo_s60_advance_2018_otomatik',
23: 'vw_jetta_comfortline_2013_otomatik',
24: 'vw_passat_highline_2017_otomatik',
25: 'vw_tiguan_sport&style_2012_manuel',
26: 'vw_tiguan_sport&style_2013_manuel'},
'freq': {0: 4,
1: 4,
2: 7,
3: 4,
4: 4,
5: 4,
6: 4,
7: 4,
8: 4,
9: 4,
10: 4,
11: 4,
12: 7,
13: 4,
14: 4,
15: 4,
16: 4,
17: 4,
18: 4,
19: 4,
20: 4,
21: 4,
22: 4,
23: 4,
24: 7,
25: 4,
26: 4}}
Edit: tried one of the answers and got an extra level of header
You need to pandas.groupby Brand and then aggregate with respect to the maximal frequency.
Something like this should work:
df.groupby('Brand')[['id', 'freq']].agg({'freq': 'max'})
To get your result, run:
result = df.groupby('Brand', as_index=False).apply(
lambda grp: grp[grp.freq == grp.freq.max()].iloc[0])

Categories

Resources