graphs overlapping and redundant code to clear it out - python

I've been using RMarkdown to create graphs. Then I take the graphs and copy and paste them into Powerpoint presentations. That's been my workflow.
Here is the dataframe that I am using.
{'Unnamed: 0': {0: 'Mazda RX4', 1: 'Mazda RX4 Wag', 2: 'Datsun 710', 3: 'Hornet 4 Drive', 4: 'Hornet Sportabout', 5: 'Valiant', 6: 'Duster 360', 7: 'Merc 240D', 8: 'Merc 230', 9: 'Merc 280', 10: 'Merc 280C', 11: 'Merc 450SE', 12: 'Merc 450SL', 13: 'Merc 450SLC', 14: 'Cadillac Fleetwood', 15: 'Lincoln Continental', 16: 'Chrysler Imperial', 17: 'Fiat 128', 18: 'Honda Civic', 19: 'Toyota Corolla', 20: 'Toyota Corona', 21: 'Dodge Challenger', 22: 'AMC Javelin', 23: 'Camaro Z28', 24: 'Pontiac Firebird', 25: 'Fiat X1-9', 26: 'Porsche 914-2', 27: 'Lotus Europa', 28: 'Ford Pantera L', 29: 'Ferrari Dino', 30: 'Maserati Bora', 31: 'Volvo 142E'}, 'mpg': {0: 21.0, 1: 21.0, 2: 22.8, 3: 21.4, 4: 18.7, 5: 18.1, 6: 14.3, 7: 24.4, 8: 22.8, 9: 19.2, 10: 17.8, 11: 16.4, 12: 17.3, 13: 15.2, 14: 10.4, 15: 10.4, 16: 14.7, 17: 32.4, 18: 30.4, 19: 33.9, 20: 21.5, 21: 15.5, 22: 15.2, 23: 13.3, 24: 19.2, 25: 27.3, 26: 26.0, 27: 30.4, 28: 15.8, 29: 19.7, 30: 15.0, 31: 21.4}, 'cyl': {0: 6, 1: 6, 2: 4, 3: 6, 4: 8, 5: 6, 6: 8, 7: 4, 8: 4, 9: 6, 10: 6, 11: 8, 12: 8, 13: 8, 14: 8, 15: 8, 16: 8, 17: 4, 18: 4, 19: 4, 20: 4, 21: 8, 22: 8, 23: 8, 24: 8, 25: 4, 26: 4, 27: 4, 28: 8, 29: 6, 30: 8, 31: 4}, 'disp': {0: 160.0, 1: 160.0, 2: 108.0, 3: 258.0, 4: 360.0, 5: 225.0, 6: 360.0, 7: 146.7, 8: 140.8, 9: 167.6, 10: 167.6, 11: 275.8, 12: 275.8, 13: 275.8, 14: 472.0, 15: 460.0, 16: 440.0, 17: 78.7, 18: 75.7, 19: 71.1, 20: 120.1, 21: 318.0, 22: 304.0, 23: 350.0, 24: 400.0, 25: 79.0, 26: 120.3, 27: 95.1, 28: 351.0, 29: 145.0, 30: 301.0, 31: 121.0}, 'hp': {0: 110, 1: 110, 2: 93, 3: 110, 4: 175, 5: 105, 6: 245, 7: 62, 8: 95, 9: 123, 10: 123, 11: 180, 12: 180, 13: 180, 14: 205, 15: 215, 16: 230, 17: 66, 18: 52, 19: 65, 20: 97, 21: 150, 22: 150, 23: 245, 24: 175, 25: 66, 26: 91, 27: 113, 28: 264, 29: 175, 30: 335, 31: 109}, 'drat': {0: 3.9, 1: 3.9, 2: 3.85, 3: 3.08, 4: 3.15, 5: 2.76, 6: 3.21, 7: 3.69, 8: 3.92, 9: 3.92, 10: 3.92, 11: 3.07, 12: 3.07, 13: 3.07, 14: 2.93, 15: 3.0, 16: 3.23, 17: 4.08, 18: 4.93, 19: 4.22, 20: 3.7, 21: 2.76, 22: 3.15, 23: 3.73, 24: 3.08, 25: 4.08, 26: 4.43, 27: 3.77, 28: 4.22, 29: 3.62, 30: 3.54, 31: 4.11}, 'wt': {0: 2.62, 1: 2.875, 2: 2.32, 3: 3.215, 4: 3.44, 5: 3.46, 6: 3.57, 7: 3.19, 8: 3.15, 9: 3.44, 10: 3.44, 11: 4.07, 12: 3.73, 13: 3.78, 14: 5.25, 15: 5.424, 16: 5.345, 17: 2.2, 18: 1.615, 19: 1.835, 20: 2.465, 21: 3.52, 22: 3.435, 23: 3.84, 24: 3.845, 25: 1.935, 26: 2.14, 27: 1.513, 28: 3.17, 29: 2.77, 30: 3.57, 31: 2.78}, 'qsec': {0: 16.46, 1: 17.02, 2: 18.61, 3: 19.44, 4: 17.02, 5: 20.22, 6: 15.84, 7: 20.0, 8: 22.9, 9: 18.3, 10: 18.9, 11: 17.4, 12: 17.6, 13: 18.0, 14: 17.98, 15: 17.82, 16: 17.42, 17: 19.47, 18: 18.52, 19: 19.9, 20: 20.01, 21: 16.87, 22: 17.3, 23: 15.41, 24: 17.05, 25: 18.9, 26: 16.7, 27: 16.9, 28: 14.5, 29: 15.5, 30: 14.6, 31: 18.6}, 'vs': {0: 0, 1: 0, 2: 1, 3: 1, 4: 0, 5: 1, 6: 0, 7: 1, 8: 1, 9: 1, 10: 1, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 1, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 0, 27: 1, 28: 0, 29: 0, 30: 0, 31: 1}, 'am': {0: 1, 1: 1, 2: 1, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 1, 18: 1, 19: 1, 20: 0, 21: 0, 22: 0, 23: 0, 24: 0, 25: 1, 26: 1, 27: 1, 28: 1, 29: 1, 30: 1, 31: 1}, 'gear': {0: 4, 1: 4, 2: 4, 3: 3, 4: 3, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 3, 15: 3, 16: 3, 17: 4, 18: 4, 19: 4, 20: 3, 21: 3, 22: 3, 23: 3, 24: 3, 25: 4, 26: 5, 27: 5, 28: 5, 29: 5, 30: 5, 31: 4}, 'carb': {0: 4, 1: 4, 2: 1, 3: 1, 4: 2, 5: 1, 6: 4, 7: 2, 8: 2, 9: 4, 10: 4, 11: 3, 12: 3, 13: 3, 14: 4, 15: 4, 16: 4, 17: 1, 18: 2, 19: 1, 20: 1, 21: 2, 22: 2, 23: 4, 24: 2, 25: 1, 26: 2, 27: 2, 28: 4, 29: 6, 30: 8, 31: 2}}
The code looks like this.
```{r, warning = FALSE, message = FALSE}
ggplot2::ggplot(data = mtcars, aes(x = wt, y = after_stat(count))) +
geom_histogram(bins = 32, color = 'black', fill = '#ffe6b7') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(face = "bold"))
ggplot2::ggplot(data = mtcars, aes(x = mpg, y = after_stat(count))) +
geom_histogram(bins = 32, color = 'black', fill = '#ffe6b7') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(face = "bold"))
ggplot2::ggplot(data = mtcars, aes(x = disp, y = after_stat(count))) +
geom_histogram(bins = 32, color = 'black', fill = '#ffe6b7') +
labs(title = "Mtcars", subtitle = "Histogram") +
theme(plot.title = element_text(face = "bold"))
```
And here is a screenshot of the output.
Now I'm trying to do the same using python graphs. I'm seeing that I can't do the same thing exactly because the graphs start overlapping.
```{python}
seaborn.histplot(data=mtcars, x="wt", bins = 30)
plt.title("wt histogram", loc = 'left')
plt.show()
seaborn.histplot(data=mtcars, x="mpg", bins = 30)
plt.title("mpg histogram", loc = 'left')
plt.show()
seaborn.histplot(data=mtcars, x="disp", bins = 30)
plt.title("disp histogram", loc = 'left')
plt.show()
```
So now what I'm doing is I'm clearing out the space after I create every single graph. The output now looks fine - I get a distinct histogram for each variable I'm calling.
```{python}
plt.figure().clear()
plt.close()
plt.cla()
plt.clf()
seaborn.histplot(data=mtcars, x="wt", bins = 30)
plt.title("wt histogram", loc = 'left')
plt.show()
plt.figure().clear()
plt.close()
plt.cla()
plt.clf()
seaborn.histplot(data=mtcars, x="mpg", bins = 30)
plt.title("mpg histogram", loc = 'left')
plt.show()
plt.figure().clear()
plt.close()
plt.cla()
plt.clf()
seaborn.histplot(data=mtcars, x="disp", bins = 30)
plt.title("disp histogram", loc = 'left')
plt.show()
```
The output is definitely better.
But isn't this method really redundant? What do people who use python more regularly do to maintain what is happening with the graphs? Do you all clear out the space every time in this way?

Related

Why does the size of my 3D Plotly Scatterplot randomly change?

I am trying to create an animated 3D scatterplot to represent fish swimming in 3D space. I have 8 fish, and for each fish I have 4 points. I am able to make the graph and animate it, however the size of the graph changes randomly between time points. I have set the axes mins and maxes, but the distance between them seems to change. What aspect of the plot do I need to alter in order to keep it stable?
This is the plotly express command that I am using:
fig = px.scatter_3d(df,x="x", y="y", z="z",
color="Fish", animation_frame="Frame", hover_data = ["BodyPart"],
range_x=[-0.25,0.25], range_y=[-0.15,0.15], range_z=[-0.15,0.15],
color_continuous_scale = "rainbow")
These two images show the graph one frame apart from one another. The green square shows stats on one point to show that it is not changing drastically:
I am also including this video for a clearer example.
Edited:
Minimum graphing code:
import pandas as pd
import plotly.express as px
data_dict = {'Fish': {0: 0, 1: 0, 2: 0, 3: 0, 4: 1, 5: 1, 6: 1, 7: 1, 8: 2, 9: 2, 10: 2, 11: 2, 12: 3, 13: 3, 14: 3, 15: 3, 16: 4, 17: 4, 18: 4, 19: 4, 20: 5, 21: 5, 22: 5, 23: 5, 24: 6, 25: 6, 26: 6, 27: 6, 28: 7, 29: 7, 30: 7, 31: 7, 32: 0, 33: 0, 34: 0, 35: 0, 36: 1, 37: 1, 38: 1, 39: 1, 40: 2, 41: 2, 42: 2, 43: 2, 44: 3, 45: 3, 46: 3, 47: 3, 48: 4, 49: 4, 50: 4, 51: 4, 52: 5, 53: 5, 54: 5, 55: 5, 56: 6, 57: 6, 58: 6, 59: 6, 60: 7, 61: 7, 62: 7, 63: 7}, 'BodyPart': {0: 'head', 1: 'midline2', 2: 'tailbase', 3: 'tailtip', 4: 'head', 5: 'midline2', 6: 'tailbase', 7: 'tailtip', 8: 'head', 9: 'midline2', 10: 'tailbase', 11: 'tailtip', 12: 'head', 13: 'midline2', 14: 'tailbase', 15: 'tailtip', 16: 'head', 17: 'midline2', 18: 'tailbase', 19: 'tailtip', 20: 'head', 21: 'midline2', 22: 'tailbase', 23: 'tailtip', 24: 'head', 25: 'midline2', 26: 'tailbase', 27: 'tailtip', 28: 'head', 29: 'midline2', 30: 'tailbase', 31: 'tailtip', 32: 'head', 33: 'midline2', 34: 'tailbase', 35: 'tailtip', 36: 'head', 37: 'midline2', 38: 'tailbase', 39: 'tailtip', 40: 'head', 41: 'midline2', 42: 'tailbase', 43: 'tailtip', 44: 'head', 45: 'midline2', 46: 'tailbase', 47: 'tailtip', 48: 'head', 49: 'midline2', 50: 'tailbase', 51: 'tailtip', 52: 'head', 53: 'midline2', 54: 'tailbase', 55: 'tailtip', 56: 'head', 57: 'midline2', 58: 'tailbase', 59: 'tailtip', 60: 'head', 61: 'midline2', 62: 'tailbase', 63: 'tailtip'}, 'x': {0: 0.121283071, 1: 0.074230535, 2: 0.096664814, 3: 0.063435668, 4: -0.11843468, 5: -0.133776416, 6: -0.12698166, 7: -0.133996648, 8: 0.154499401, 9: 0.099541555, 10: 0.126525899, 11: 0.086448979, 12: -0.001723707, 13: -0.064203743, 14: -0.033163578, 15: -0.077987938, 16: 0.160456072, 17: 0.175340028, 18: 0.178537856, 19: 0.16438273, 20: -0.151890354, 21: -0.099510254, 22: -0.123827166, 23: -0.08765671, 24: 0.052741099, 25: -0.003778201, 26: 0.022010701, 27: -0.014747641, 28: -0.137528989, 29: -0.078632593, 30: -0.106688178, 31: -0.065274018, 32: 0.12128202, 33: 0.074230379, 34: 0.096662597, 35: 0.063435699, 36: -0.118412987, 37: -0.133729238, 38: -0.12729935, 39: -0.134238167, 40: 0.154498856, 41: 0.099541572, 42: 0.126525899, 43: 0.086450612, 44: -0.001719156, 45: -0.064209291, 46: -0.033163578, 47: -0.07796947, 48: 0.157094899, 49: 0.175288008, 50: 0.178383788, 51: 0.1643551, 52: -0.153086656, 53: -0.100645272, 54: -0.125700666, 55: -0.089248865, 56: 0.052731775, 57: -0.003778201, 58: 0.022011924, 59: -0.014749184, 60: -0.138954183, 61: -0.079588201, 62: -0.107413558, 63: -0.06588028}, 'y': {0: -0.018777537, 1: -0.017936625, 2: -0.019031854, 3: -0.018688299, 4: 0.031655295, 5: 0.089278103, 6: 0.060434868, 7: 0.102354879, 8: 0.012448659, 9: 0.005374916, 10: 0.008431857, 11: 0.010384436, 12: 0.007394437, 13: 0.002657548, 14: 0.0047918, 15: 0.004216939, 16: -0.061691249, 17: -0.022574622, 18: -0.044862196, 19: -0.015288812, 20: 0.126254494, 21: 0.125420316, 22: 0.127216595, 23: 0.122366769, 24: -0.018798237, 25: -0.026209512, 26: -0.020654802, 27: -0.030922742, 28: 0.100460973, 29: 0.091726762, 30: 0.095608508, 31: 0.089022071, 32: -0.018930378, 33: -0.018313362, 34: -0.019121954, 35: -0.018839649, 36: 0.030465513, 37: 0.087966041, 38: 0.058855924, 39: 0.100617287, 40: 0.012372615, 41: 0.00530059, 42: 0.008431857, 43: 0.009864426, 44: 0.007169236, 45: 0.002524294, 46: 0.0047918, 47: 0.002813216, 48: -0.061409007, 49: -0.024774863, 50: -0.045825365, 51: -0.017002469, 52: 0.125813664, 53: 0.125533354, 54: 0.126988948, 55: 0.121414741, 56: -0.019165739, 57: -0.026209512, 58: -0.020802186, 59: -0.031842627, 60: 0.100213119, 61: 0.091677506, 62: 0.095490242, 63: 0.08724155}, 'z': {0: -0.011584533, 1: -0.005671144, 2: -0.004720913, 3: -0.007099159, 4: 0.048633092, 5: 0.044680886, 6: 0.047755313, 7: 0.047602698, 8: 0.005219131, 9: 0.020195691, 10: 0.013766486, 11: 0.019271016, 12: -0.009086866, 13: 0.005213358, 14: -0.003552202, 15: 0.001820855, 16: -0.039992723, 17: 0.041166976, 18: -0.013040119, 19: 0.048827692, 20: 0.044577227, 21: 0.043492943, 22: 0.045104437, 23: 0.0399218, 24: 0.007934858, 25: 0.007980119, 26: 0.010593472, 27: 0.006390279, 28: 0.070277892, 29: 0.066889416, 30: 0.070485941, 31: 0.054907996, 32: -0.011559485, 33: -0.005583401, 34: -0.004725084, 35: -0.007089815, 36: 0.048823811, 37: 0.04574317, 38: 0.047201689, 39: 0.043995531, 40: 0.005234299, 41: 0.020211407, 42: 0.013766486, 43: 0.019405438, 44: -0.009034049, 45: 0.005200504, 46: -0.003552202, 47: 0.002061042, 48: -0.035258171, 49: 0.041424053, 50: -0.013317812, 51: 0.048629332, 52: 0.043972705, 53: 0.042581942, 54: 0.046299595, 55: 0.040028712, 56: 0.007931264, 57: 0.007980119, 58: 0.010624531, 59: 0.006616644, 60: 0.068992196, 61: 0.064455916, 62: 0.07226277, 63: 0.056393304}, 'Frame': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 0, 22: 0, 23: 0, 24: 0, 25: 0, 26: 0, 27: 0, 28: 0, 29: 0, 30: 0, 31: 0, 32: 1, 33: 1, 34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 39: 1, 40: 1, 41: 1, 42: 1, 43: 1, 44: 1, 45: 1, 46: 1, 47: 1, 48: 1, 49: 1, 50: 1, 51: 1, 52: 1, 53: 1, 54: 1, 55: 1, 56: 1, 57: 1, 58: 1, 59: 1, 60: 1, 61: 1, 62: 1, 63: 1}}
df = pd.DataFrame(data_dict)
fig = px.scatter_3d(df,x="x", y="y", z="z", color="Fish", animation_frame="Frame", hover_data = ["BodyPart"],
range_x=[-0.25,0.25], range_y=[-0.15,0.15], range_z=[-0.15,0.15], color_continuous_scale = "rainbow")
fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))
fig.show()
This seems related to the aspectratio in fig.layout.scene:
layout.Scene({
'aspectmode': 'auto',
'aspectratio': {'x': 1.7359689116422856, 'y': 0.9924641251101735, 'z':0.5804211635071164},
If you manually set x, y and z in the dict above to something specific, the flinching of the figure between animation frames seems to disappear.
I've tried:
fig.layout.scene.aspectratio = {'x':1, 'y':1, 'z':1}
fig.show()
And the results are promising. Give it a go on your end and let me know how it works out for you.
It also seems, as you've already discovered, to work best in tandem with setting defined ranges for x_range, y_range, z_range. Since your datasample is a bit limited, I've been messing around with px.data.gapminder().
Plot
Complete code
import plotly.express as px
df = px.data.gapminder()
# df
fig = px.scatter_3d(df, x = 'pop', y='lifeExp', z = 'gdpPercap', animation_frame='year',
range_x=[int(df['pop'].min()*0.5),int(df['pop'].max()*1.5)],
range_y=[int(df.lifeExp.min()*0.5),int(df.lifeExp.max()*1.5)],
range_z=[int(df['gdpPercap'].min()*0.5),int(df['gdpPercap'].max()*1.5)]
)
fig.layout.scene.aspectratio = {'x':1, 'y':1, 'z':1}
fig.show()

How to add lines with annotations to candlestick charts when some values are missing?

I'm trying to use Plotly to overlay a marker/line chart on top of my OHLC candle chart.
Code
import plotly.graph_objects as go
import pandas as pd
from datetime import datetime
df = pd.DataFrame(
{'index': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20,
21: 21,
22: 22,
23: 23,
24: 24},
'Date': {0: '2018-09-03',
1: '2018-09-04',
2: '2018-09-05',
3: '2018-09-06',
4: '2018-09-07',
5: '2018-09-10',
6: '2018-09-11',
7: '2018-09-12',
8: '2018-09-13',
9: '2018-09-14',
10: '2018-09-17',
11: '2018-09-18',
12: '2018-09-19',
13: '2018-09-20',
14: '2018-09-21',
15: '2018-09-24',
16: '2018-09-25',
17: '2018-09-26',
18: '2018-09-27',
19: '2018-09-28',
20: '2018-10-01',
21: '2018-10-02',
22: '2018-10-03',
23: '2018-10-04',
24: '2018-10-05'},
'Open': {0: 1.2922067642211914,
1: 1.2867859601974487,
2: 1.2859420776367188,
3: 1.2914056777954102,
4: 1.2928247451782229,
5: 1.292808175086975,
6: 1.3027958869934082,
7: 1.3017443418502808,
8: 1.30451238155365,
9: 1.3110626935958862,
10: 1.3071041107177734,
11: 1.3146650791168213,
12: 1.3166556358337402,
13: 1.3140604496002195,
14: 1.3271400928497314,
15: 1.3080958127975464,
16: 1.3117163181304932,
17: 1.3180439472198486,
18: 1.3169677257537842,
19: 1.3077707290649414,
20: 1.3039510250091553,
21: 1.3043931722640991,
22: 1.2979763746261597,
23: 1.2941633462905884,
24: 1.3022021055221558},
'High': {0: 1.2934937477111816,
1: 1.2870012521743774,
2: 1.2979259490966797,
3: 1.2959914207458496,
4: 1.3024225234985352,
5: 1.3052103519439695,
6: 1.30804443359375,
7: 1.3044441938400269,
8: 1.3120088577270508,
9: 1.3143367767333984,
10: 1.3156682252883911,
11: 1.3171066045761108,
12: 1.3211784362792969,
13: 1.3296104669570925,
14: 1.3278449773788452,
15: 1.3166556358337402,
16: 1.3175750970840454,
17: 1.3196094036102295,
18: 1.3180439472198486,
19: 1.3090718984603882,
20: 1.3097577095031738,
21: 1.3049719333648682,
22: 1.3020155429840088,
23: 1.3036959171295166,
24: 1.310753345489502},
'Low': {0: 1.2856279611587524,
1: 1.2813942432403564,
2: 1.2793285846710205,
3: 1.289723515510559,
4: 1.2918561697006226,
5: 1.289823293685913,
6: 1.2976733446121216,
7: 1.298414707183838,
8: 1.3027619123458862,
9: 1.3073604106903076,
10: 1.3070186376571655,
11: 1.3120776414871216,
12: 1.3120431900024414,
13: 1.3140085935592651,
14: 1.305841088294983,
15: 1.3064552545547483,
16: 1.3097233772277832,
17: 1.3141123056411743,
18: 1.309706211090088,
19: 1.3002548217773438,
20: 1.3014055490493774,
21: 1.2944146394729614,
22: 1.2964619398117063,
23: 1.2924572229385376,
24: 1.3005592823028564},
'Close': {0: 1.292306900024414,
1: 1.2869019508361816,
2: 1.2858428955078125,
3: 1.2914891242980957,
4: 1.2925406694412231,
5: 1.2930254936218262,
6: 1.302643060684204,
7: 1.3015578985214231,
8: 1.304546356201172,
9: 1.311131477355957,
10: 1.307326316833496,
11: 1.3146305084228516,
12: 1.3168463706970217,
13: 1.3141123056411743,
14: 1.327087163925171,
15: 1.30804443359375,
16: 1.3117333650588991,
17: 1.3179919719696045,
18: 1.3172800540924072,
19: 1.3078734874725342,
20: 1.3039000034332275,
21: 1.3043591976165771,
22: 1.2981956005096436,
23: 1.294062852859497,
24: 1.3024225234985352},
'Pivot Price': {0: 1.2934937477111816,
1: np.nan,
2: 1.2793285846710205,
3: np.nan,
4: np.nan,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: np.nan,
11: np.nan,
12: np.nan,
13: 1.3296104669570925,
14: np.nan,
15: np.nan,
16: np.nan,
17: np.nan,
18: np.nan,
19: np.nan,
20: np.nan,
21: np.nan,
22: np.nan,
23: 1.2924572229385376,
24: np.nan}})
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
open=df['Open'],
high=df['High'],
low=df['Low'],
close=df['Close'])])
fig.add_trace(
go.Scatter(mode = "lines+markers",
x=df['Date'],
y=df["Pivot Price"]
))
fig.update_layout(
autosize=False,
width=1000,
height=800,)
fig.show()
This is the current image
This is the desired output/image
I want black line between the markers (pivots). I would also ideally like a value next to each line showing the distance between each pivot but Im not sure how to do this.
For example the distance between the first two pivots round(abs(1.293494 - 1.279329),3) returns 0.014 so I would ideally like this next to the line.
The second is round(abs(1.279329 - 1.329610),3) so the value would be 0.05. I have hand edited the image and added the lines for the first two values to give a visual representation of what Im trying to achieve.
The problem seems to be the missing values. So just use pandas.Series.interpolate in combination with fig.add_annotation to get:
I've included annotations for differences as well. There are surely more elegant ways to do it than with for loops, but it does the job. Let me know if anything is unclear!
import pandas as pd
import numpy as np
import plotly.graph_objects as go
df = pd.DataFrame(
{'index': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20,
21: 21,
22: 22,
23: 23,
24: 24},
'Date': {0: '2018-09-03',
1: '2018-09-04',
2: '2018-09-05',
3: '2018-09-06',
4: '2018-09-07',
5: '2018-09-10',
6: '2018-09-11',
7: '2018-09-12',
8: '2018-09-13',
9: '2018-09-14',
10: '2018-09-17',
11: '2018-09-18',
12: '2018-09-19',
13: '2018-09-20',
14: '2018-09-21',
15: '2018-09-24',
16: '2018-09-25',
17: '2018-09-26',
18: '2018-09-27',
19: '2018-09-28',
20: '2018-10-01',
21: '2018-10-02',
22: '2018-10-03',
23: '2018-10-04',
24: '2018-10-05'},
'Open': {0: 1.2922067642211914,
1: 1.2867859601974487,
2: 1.2859420776367188,
3: 1.2914056777954102,
4: 1.2928247451782229,
5: 1.292808175086975,
6: 1.3027958869934082,
7: 1.3017443418502808,
8: 1.30451238155365,
9: 1.3110626935958862,
10: 1.3071041107177734,
11: 1.3146650791168213,
12: 1.3166556358337402,
13: 1.3140604496002195,
14: 1.3271400928497314,
15: 1.3080958127975464,
16: 1.3117163181304932,
17: 1.3180439472198486,
18: 1.3169677257537842,
19: 1.3077707290649414,
20: 1.3039510250091553,
21: 1.3043931722640991,
22: 1.2979763746261597,
23: 1.2941633462905884,
24: 1.3022021055221558},
'High': {0: 1.2934937477111816,
1: 1.2870012521743774,
2: 1.2979259490966797,
3: 1.2959914207458496,
4: 1.3024225234985352,
5: 1.3052103519439695,
6: 1.30804443359375,
7: 1.3044441938400269,
8: 1.3120088577270508,
9: 1.3143367767333984,
10: 1.3156682252883911,
11: 1.3171066045761108,
12: 1.3211784362792969,
13: 1.3296104669570925,
14: 1.3278449773788452,
15: 1.3166556358337402,
16: 1.3175750970840454,
17: 1.3196094036102295,
18: 1.3180439472198486,
19: 1.3090718984603882,
20: 1.3097577095031738,
21: 1.3049719333648682,
22: 1.3020155429840088,
23: 1.3036959171295166,
24: 1.310753345489502},
'Low': {0: 1.2856279611587524,
1: 1.2813942432403564,
2: 1.2793285846710205,
3: 1.289723515510559,
4: 1.2918561697006226,
5: 1.289823293685913,
6: 1.2976733446121216,
7: 1.298414707183838,
8: 1.3027619123458862,
9: 1.3073604106903076,
10: 1.3070186376571655,
11: 1.3120776414871216,
12: 1.3120431900024414,
13: 1.3140085935592651,
14: 1.305841088294983,
15: 1.3064552545547483,
16: 1.3097233772277832,
17: 1.3141123056411743,
18: 1.309706211090088,
19: 1.3002548217773438,
20: 1.3014055490493774,
21: 1.2944146394729614,
22: 1.2964619398117063,
23: 1.2924572229385376,
24: 1.3005592823028564},
'Close': {0: 1.292306900024414,
1: 1.2869019508361816,
2: 1.2858428955078125,
3: 1.2914891242980957,
4: 1.2925406694412231,
5: 1.2930254936218262,
6: 1.302643060684204,
7: 1.3015578985214231,
8: 1.304546356201172,
9: 1.311131477355957,
10: 1.307326316833496,
11: 1.3146305084228516,
12: 1.3168463706970217,
13: 1.3141123056411743,
14: 1.327087163925171,
15: 1.30804443359375,
16: 1.3117333650588991,
17: 1.3179919719696045,
18: 1.3172800540924072,
19: 1.3078734874725342,
20: 1.3039000034332275,
21: 1.3043591976165771,
22: 1.2981956005096436,
23: 1.294062852859497,
24: 1.3024225234985352},
'Pivot Price': {0: 1.2934937477111816,
1: np.nan,
2: 1.2793285846710205,
3: np.nan,
4: np.nan,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: np.nan,
11: np.nan,
12: np.nan,
13: 1.3296104669570925,
14: np.nan,
15: np.nan,
16: np.nan,
17: np.nan,
18: np.nan,
19: np.nan,
20: np.nan,
21: np.nan,
22: np.nan,
23: 1.2924572229385376,
24: np.nan}})
import plotly.graph_objects as go
import pandas as pd
from datetime import datetime
# df=pd.read_csv("for_so.csv")
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
# fig = go.Figure(data=[go.Candlestick(x=df.index,
open=df['Open'],
high=df['High'],
low=df['Low'],
close=df['Close'])])
# some calculations
df_diff = df['Pivot Price'].dropna().diff().copy()
df2 = df[df.index.isin(df_diff.index)].copy()
df2['Price Diff'] = df['Pivot Price'].dropna().values
fig.add_trace(
go.Scatter(mode = "lines+markers",
x=df['Date'],
y=df["Pivot Price"]
))
fig.update_layout(
autosize=False,
width=1000,
height=800,)
fig.add_trace(go.Scatter(x=df['Date'], y=df['Pivot Price'].interpolate(),
# fig.add_trace(go.Scatter(x=df.index, y=df['Pivot Price'].interpolate(),
mode = 'lines',
line = dict(color='black')))
def annot(value):
# print(type(value))
if np.isnan(value):
return ''
else:
return value
j = 0
for i, p in enumerate(df['Pivot Price']):
# print(p)
# if not np.isnan(p) and not np.isnan(df_diff.iloc[j]):
if not np.isnan(p):
# print(not np.isnan(df_diff.iloc[j]))
fig.add_annotation(dict(font=dict(color='rgba(0,0,200,0.8)',size=12),
x=df['Date'].iloc[i],
# x=df.index[i],
# x = xStart
y=p,
showarrow=False,
text=annot(round(abs(df_diff.iloc[j]),3)),
textangle=0,
xanchor='right',
xref="x",
yref="y"))
j = j + 1
fig.update_xaxes(type='category')
fig.show()
Problem seems the missing values, plotly has difficulty with. With this trick you can only plot the point;
has_value = ~df["Pivot Price"].isna()
import plotly.graph_objects as go
import pandas as pd
from datetime import datetime
df=pd.read_csv("notebooks/for_so.csv")
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
open=df['Open'],
high=df['High'],
low=df['Low'],
close=df['Close'])])
fig.add_trace(
go.Scatter(mode = 'lines',
x=df[has_value]['Date'],
y=df[has_value]["Pivot Price"], line={'color':'black', 'width':1}
))
fig.add_trace(
go.Scatter(mode = "markers",
x=df['Date'],
y=df["Pivot Price"]
))
fig.update_layout(
autosize=False,
width=1000,
height=800,)
fig.show()
This did it for me.

How to remove duplicates based on lower frequency [duplicate]

This question already has answers here:
Get the row(s) which have the max value in groups using groupby
(15 answers)
Closed 2 years ago.
I have a table that looks like this
I want to be able to keep ids for brands that have highest freq. For example in case of audi both ids have same frequencies so keep only one. In case of mercedes-benz keep the latter one since it has frequency 7.
This is my dataframe:
{'Brand':
{0: 'audi',
1: 'audi',
2: 'bmw',
3: 'dacia',
4: 'fiat',
5: 'ford',
6: 'ford',
7: 'honda',
8: 'honda',
9: 'hyundai',
10: 'kia',
11: 'mercedes-benz',
12: 'mercedes-benz',
13: 'nissan',
14: 'nissan',
15: 'opel',
16: 'renault',
17: 'renault',
18: 'renault',
19: 'renault',
20: 'toyota',
21: 'toyota',
22: 'volvo',
23: 'vw',
24: 'vw',
25: 'vw',
26: 'vw'},
'id':
{0: 'audi_a4_dynamic_2016_otomatik',
1: 'audi_a6_standart_2015_otomatik',
2: 'bmw_5 series_executive_2016_otomatik',
3: 'dacia_duster_laureate_2017_manuel',
4: 'fiat_egea_easy_2017_manuel',
5: 'ford_focus_trend x_2015_manuel',
6: 'ford_focus_trend x_2015_otomatik',
7: 'honda_civic_eco elegance_2017_otomatik',
8: 'honda_cr-v_executive_2018_otomatik',
9: 'hyundai_tucson_elite plus_2017_otomatik',
10: 'kia_sportage_concept plus_2015_otomatik',
11: 'mercedes-benz_c-class_amg_2016_otomatik',
12: 'mercedes-benz_e-class_edition e_2015_otomatik',
13: 'nissan_qashqai_black edition_2014_manuel',
14: 'nissan_qashqai_sky pack_2015_otomatik',
15: 'opel_astra_edition_2016_manuel',
16: 'renault_clio_joy_2016_manuel',
17: 'renault_kadjar_icon_2015_otomatik',
18: 'renault_kadjar_icon_2016_otomatik',
19: 'renault_mégane_touch_2017_otomatik',
20: 'toyota_corolla_touch_2015_otomatik',
21: 'toyota_corolla_touch_2016_otomatik',
22: 'volvo_s60_advance_2018_otomatik',
23: 'vw_jetta_comfortline_2013_otomatik',
24: 'vw_passat_highline_2017_otomatik',
25: 'vw_tiguan_sport&style_2012_manuel',
26: 'vw_tiguan_sport&style_2013_manuel'},
'freq': {0: 4,
1: 4,
2: 7,
3: 4,
4: 4,
5: 4,
6: 4,
7: 4,
8: 4,
9: 4,
10: 4,
11: 4,
12: 7,
13: 4,
14: 4,
15: 4,
16: 4,
17: 4,
18: 4,
19: 4,
20: 4,
21: 4,
22: 4,
23: 4,
24: 7,
25: 4,
26: 4}}
Edit: tried one of the answers and got an extra level of header
You need to pandas.groupby Brand and then aggregate with respect to the maximal frequency.
Something like this should work:
df.groupby('Brand')[['id', 'freq']].agg({'freq': 'max'})
To get your result, run:
result = df.groupby('Brand', as_index=False).apply(
lambda grp: grp[grp.freq == grp.freq.max()].iloc[0])

Nested for loops to create muliple pivot table based on 2 level multiindex in pandas

Started getting confused with this one. I have a large Fact Invoice Header table. I took the original dataframe, used a groupby to split the df up based upon one column. The output was a list of dataframes:
list_of_dfs = []
for _, g in df.groupby(df['Project State Name']):
list_of_dfs.append(g)
list_of_dfs
Then I used a another for loop to loop through the list of dataframes and perform one pivot table aggregation.
for each_state_df in list_of_dfs:
columns_to_index_by = ['Project Issue', 'Project Secondary Issue', 'Project Client Name']
# Aggregating to the Project Level
table_for_pivots = pd.pivot_table(df, index=['FY Year', 'Project Issue'], values=["Project Key", 'Total Net Amount', "Project Total Resolution Amount", 'Project Budgeted Amount'],
aggfunc= {"Project Key": lambda x: len(x.unique()), 'Total Net Amount': np.sum, "Project Total Resolution Amount": np.mean,
'Project Budgeted Amount': np.mean},
fill_value=np.mean)
print(table_for_pivots)
My question is, how can I use another for loop replace the second element in the pivot table index with each value in the variable columns_to_index_by? The output would be 3 pivot tables where index=[‘FY Year’, ‘Project Issue’], index=[‘FY Year’, ‘Project Secondary Issue’, and index=[‘FY Year’, ‘Project Client Name’]. Thanks all!
Link to download a sample df data is here:
https://ufile.io/iufv9nma
Use list comprehension and iterate through a zip of the index you want to set for each group:
from pandas import Timestamp
from numpy import nan
d = {'Total Net Amount': {2: 672.0, 41: 1277.9, 17: 270.0, 32: 845.3, 26: 828.62, 11: 733.5, 23: 1741.8, 35: 254.14655, 29: 245.0, 59: 215.0, 38: 617.4, 0: 1061.5}, 'Project Total Resolution Amount': {2: 35000, 41: 27000, 17: 40000, 32: 27000, 26: 27000, 11: 40000, 23: 27000, 35: 27000, 29: 27000, 59: 27000, 38: 27000, 0: 30000}, 'Invoice Header Key': {2: 1229422, 41: 984803, 17: 1270731, 32: 938069, 26: 911535, 11: 1247443, 23: 902150, 35: 943737, 29: 918888, 59: 1071541, 38: 965091, 0: 1279581}, 'Project Key': {2: 259661, 41: 194517, 17: 259188, 32: 194517, 26: 194517, 11: 259188, 23: 194517, 35: 194517, 29: 194517, 59: 194517, 38: 194517, 0: 263736}, 'Project Secondary Issue': {2: 2, 41: 4, 17: 0, 32: 3, 26: 3, 11: 0, 23: 4, 35: 4, 29: 4, 59: 4, 38: 3, 0: 4}, 'Organization Key': {2: 16029, 41: 22638, 17: 24230, 32: 22638, 26: 22638, 11: 24230, 23: 22638, 35: 22638, 29: 22638, 59: 22638, 38: 22638, 0: 4532}, 'Project Budgeted Amount': {2: 42735.0, 41: 32500.0, 17: 26000.0, 32: 32500.0, 26: 32500.0, 11: 26000.0, 23: 32500.0, 35: 32500.0, 29: 32500.0, 59: 32500.0, 38: 32500.0, 0: nan}, 'Project State Name': {2: 0, 41: 1, 17: 2, 32: 1, 26: 1, 11: 2, 23: 1, 35: 1, 29: 1, 59: 1, 38: 1, 0: 1}, 'Project Issue': {2: 0, 41: 2, 17: 1, 32: 2, 26: 2, 11: 1, 23: 2, 35: 2, 29: 2, 59: 2, 38: 2, 0: 1}, 'Project Number': {2: 2, 41: 0, 17: 1, 32: 0, 26: 0, 11: 1, 23: 0, 35: 0, 29: 0, 59: 0, 38: 0, 0: 3}, 'Project Client Name': {2: 1, 41: 0, 17: 0, 32: 0, 26: 0, 11: 0, 23: 0, 35: 0, 29: 0, 59: 0, 38: 0, 0: 1}, 'Paid Date Year Month': {2: 13, 41: 7, 17: 15, 32: 4, 26: 2, 11: 14, 23: 1, 35: 5, 29: 3, 59: 12, 38: 6, 0: 16}, 'FY Year': {2: 2, 41: 0, 17: 2, 32: 0, 26: 0, 11: 2, 23: 0, 35: 0, 29: 0, 59: 1, 38: 0, 0: 2}, 'Invoice Paid Date': {2: Timestamp('2019-09-10 00:00:00'), 41: Timestamp('2017-12-20 00:00:00'), 17: Timestamp('2019-11-25 00:00:00'), 32: Timestamp('2017-08-31 00:00:00'), 26: Timestamp('2017-06-14 00:00:00'), 11: Timestamp('2019-10-08 00:00:00'), 23: Timestamp('2017-05-30 00:00:00'), 35: Timestamp('2017-09-07 00:00:00'), 29: Timestamp('2017-07-10 00:00:00'), 59: Timestamp('2018-10-03 00:00:00'), 38: Timestamp('2017-11-03 00:00:00'), 0: Timestamp('2019-12-12 00:00:00')}, 'Invoice Paid Date Key': {2: 20190910, 41: 20171220, 17: 20191125, 32: 20170831, 26: 20170614, 11: 20191008, 23: 20170530, 35: 20170907, 29: 20170710, 59: 20181003, 38: 20171103, 0: 20191212}, 'Count Project Secondary Issue': {2: 3, 41: 3, 17: 3, 32: 3, 26: 3, 11: 3, 23: 3, 35: 3, 29: 3, 59: 3, 38: 3, 0: 2}, 'Total Net Amount By Count Project Secondary Issue': {2: 224.0, 41: 425.9666666666667, 17: 90.0, 32: 281.7666666666667, 26: 276.2066666666666, 11: 244.5, 23: 580.6, 35: 84.71551666666666, 29: 81.66666666666667, 59: 71.66666666666667, 38: 205.8, 0: 530.75}, 'Total Net Invoice Amount': {2: 672.0, 41: 1277.9, 17: 270.0, 32: 845.3, 26: 828.62, 11: 733.5, 23: 1741.8, 35: 254.14655, 29: 245.0, 59: 215.0, 38: 617.4, 0: 1061.5}, 'Total Project Invoice Amount': {2: 7176.52, 41: 10110.98655, 17: 1678.5, 32: 10110.98655, 26: 10110.98655, 11: 1678.5, 23: 10110.98655, 35: 10110.98655, 29: 10110.98655, 59: 10110.98655, 38: 10110.98655, 0: 1061.5}, 'Invoice Dollar Percent of Project': {2: 0.09363869953682286, 41: 0.1263872712796755, 17: 0.160857908847185, 32: 0.08360212881501655, 26: 0.08195243816242638, 11: 0.4369973190348526, 23: 0.1722680562758735, 35: 0.02513568272919916, 29: 0.02423106773888449, 59: 0.02126399821983741, 38: 0.06106229070198891, 0: 1.0}}
df = pd.DataFrame(d)
# list comprehension with groupby
group = [g for _, g in df.groupby('Project State Name')]
#create a list of indices you want to use in pivot
idx = [['FY Year', 'Project Issue'],
['FY Year', 'Project Secondary Issue'],
['FY Year', 'Project Client Name']]
# create a list of columns to add to the value param in pivot
values = ["Project Key", 'Total Net Amount',
"Project Total Resolution Amount", 'Project Budgeted Amount']
# use your current pivot and iterate through zip(idx, group)
dfs = [pd.pivot_table(df, index=i, values=values,
aggfunc= {"Project Key": lambda x: len(x.unique()), 'Total Net Amount': np.sum,
"Project Total Resolution Amount": np.mean,
'Project Budgeted Amount': np.mean},
fill_value=np.mean) for i,df in zip(idx, group)]
dict comprehension
I did not know what you wanted the key to be so I just selected the second value from idx. You will call each dataframe from the dict by dfs['Project Issue']
dfs = {i[1]: pd.pivot_table(df, index=i, values=values,
aggfunc= {"Project Key": lambda x: len(x.unique()), 'Total Net Amount': np.sum,
"Project Total Resolution Amount": np.mean,
'Project Budgeted Amount': np.mean},
fill_value=np.mean) for i,df in zip(idx, group)}

group by and calculate auc on folds

What I would like to do, based on the dataset below, is to calculate the AUC for each algorithm and also later for each dataset. I have tried something like this but it is not working:
from sklearn.metrics import roc_auc_score,roc_curve,scorer
import pandas as pd
test = pd.DataFrame(dico)
def auc_group(y_hat, y):
return roc_auc_score(y_hat, y)
test.groupby(["Dataset", "Algo"]).apply(auc_group)
Later I would like to do the same operation but on Folds of KFolds which will just another layer of groupby
from sklearn.metrics import roc_auc_score,roc_curve,scorer
import pandas as pd
test = pd.DataFrame(dico)
def auc_group(y_hat, y):
return roc_auc_score(y_hat, y)
test.groupby(["Dataset", "Algo", "Folds"]).apply(auc_group)
And here is the data
dico = {'Dataset': {0: 'UCI',
1: 'UCI',
2: 'UCI',
3: 'UCI',
4: 'UCI',
5: 'UCI',
6: 'UCI',
7: 'UCI',
8: 'UCI',
9: 'UCI',
10: 'UCI',
11: 'UCI',
12: 'UCI',
13: 'UCI',
14: 'UCI',
15: 'UCI',
16: 'UCI',
17: 'UCI',
18: 'UCI',
19: 'UCI',
20: 'UCI',
21: 'UCI',
22: 'UCI',
23: 'UCI',
24: 'UCI',
25: 'UCI',
26: 'UCI',
27: 'UCI',
28: 'UCI',
29: 'UCI',
30: 'UCI',
31: 'UCI',
32: 'UCI',
33: 'UCI',
34: 'UCI',
35: 'UCI',
36: 'UCI',
37: 'UCI',
38: 'UCI',
39: 'UCI'},
'Algo': {0: 'Gnb',
1: 'Gnb',
2: 'Gnb',
3: 'Gnb',
4: 'Gnb',
5: 'Gnb',
6: 'Gnb',
7: 'Gnb',
8: 'Gnb',
9: 'Gnb',
10: 'Gnb',
11: 'Gnb',
12: 'Gnb',
13: 'Gnb',
14: 'Gnb',
15: 'Gnb',
16: 'Gnb',
17: 'Gnb',
18: 'Gnb',
19: 'Gnb',
20: 'LR',
21: 'LR',
22: 'LR',
23: 'LR',
24: 'LR',
25: 'LR',
26: 'LR',
27: 'LR',
28: 'LR',
29: 'LR',
30: 'LR',
31: 'LR',
32: 'LR',
33: 'LR',
34: 'LR',
35: 'LR',
36: 'LR',
37: 'LR',
38: 'LR',
39: 'LR'},
'p(y=1)': {0: 0.008566693461697914,
1: 0.023329740200720657,
2: 0.013079244223084688,
3: 0.0035655899487093525,
4: 0.5412516864202239,
5: 0.02437104068449619,
6: 0.0015772504872503706,
7: 0.01976775149918856,
8: 0.02580128697308947,
9: 0.052349648267671536,
10: 0.016115492810474592,
11: 0.028573206085476182,
12: 0.9975288953422592,
13: 0.1281394485094793,
14: 0.0014564219132441555,
15: 0.015625393606472308,
16: 0.15181450609384148,
17: 0.015221143650194884,
18: 0.022419878846782183,
19: 0.9991431483286071,
20: 0.04281920675218464,
21: 0.035985853029231185,
22: 0.05570563548576814,
23: 0.5468626213371839,
24: 0.01616233084557819,
25: 0.025090866736312712,
26: 0.4368789472788432,
27: 0.5268969392335681,
28: 0.06716466142340655,
29: 0.2093170587100108,
30: 0.008660602880515709,
31: 0.10929145816022637,
32: 0.04069088617214272,
33: 0.06683143493934368,
34: 0.06653318086395299,
35: 0.016010358473692744,
36: 0.08583523793056999,
37: 0.044347932186208014,
38: 0.014208157887412804,
39: 0.007949785472510792},
'y_hat': {0: 0,
1: 0,
2: 0,
3: 0,
4: 1,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0,
10: 0,
11: 0,
12: 1,
13: 0,
14: 0,
15: 0,
16: 0,
17: 0,
18: 0,
19: 1,
20: 0,
21: 0,
22: 0,
23: 1,
24: 0,
25: 0,
26: 0,
27: 1,
28: 0,
29: 0,
30: 0,
31: 0,
32: 0,
33: 0,
34: 0,
35: 0,
36: 0,
37: 0,
38: 0,
39: 0},
'y': {0: 0,
1: 0,
2: 0,
3: 0,
4: 0,
5: 0,
6: 0,
7: 0,
8: 0,
9: 0,
10: 0,
11: 0,
12: 1,
13: 1,
14: 0,
15: 0,
16: 0,
17: 0,
18: 0,
19: 1,
20: 0,
21: 0,
22: 0,
23: 1,
24: 0,
25: 0,
26: 0,
27: 0,
28: 0,
29: 0,
30: 0,
31: 0,
32: 0,
33: 0,
34: 0,
35: 0,
36: 0,
37: 0,
38: 0,
39: 0}}
Here is the error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
724 try:
--> 725 result = self._python_apply_general(f)
726 except Exception:
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
741 def _python_apply_general(self, f):
--> 742 keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
743
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
236 group_axes = _get_axes(group)
--> 237 res = f(group)
238 if not _is_indexed_like(res, group_axes):
TypeError: auc_group() missing 1 required positional argument: 'y'
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call last)
<ipython-input-23-eab997668f67> in <module>
2 return roc_auc_score(y_hat, y)
3
----> 4 test.groupby(["Dataset", "Algo"]).apply(auc_group)
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
735
736 with _group_selection_context(self):
--> 737 return self._python_apply_general(f)
738
739 return result
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
740
741 def _python_apply_general(self, f):
--> 742 keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
743
744 return self._wrap_applied_output(
/opt/conda/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
235 # group might be modified
236 group_axes = _get_axes(group)
--> 237 res = f(group)
238 if not _is_indexed_like(res, group_axes):
239 mutated = True
TypeError: auc_group() missing 1 required positional argument: 'y'
The problem with your code is that the function in apply, auc_group, should take the whole dataframe, not some parts of it, as input. Changing auc_group like the following should solve the problem:
def auc_group(df):
y_hat = df.y_hat
y = df.y
return roc_auc_score(y_hat, y)
With this change and your data,
test.groupby(["Dataset", "Algo"]).apply(auc_group)
produces
Dataset Algo
UCI Gnb 0.803922
LR 0.750000
dtype: float64
The key you will find out is that how to operate the grouped dataframe. It's pretty straight-forward. Just loop the DataFrameGroupBy object like that :
for _, grp in test.groupby(["Dataset", "Algo"]):
print(grp.head())
# do whatever you want to do with the grouped dataframe
Back to your question, we can write like that:
result = {}
for _, grp in test.groupby(["Dataset", "Algo"]):
grp_flag = grp['Dataset'].iloc[0] + '_' + grp['Algo'].iloc[0]
auc = auc_group(grp['y_hat'],grp['y'])
result[grp_flag] = auc
Then, you will get :
result
{'UCI_Gnb': 0.8039215686274509, 'UCI_LR': 0.75}
For your later operation. You just change the groupby list into 3 and grp_flag in to 3 like:
result = {}
for _, grp in test.groupby(["Dataset", "Algo","Folds"]):
grp_flag = grp['Dataset'].iloc[0] + '_' + grp['Algo'].iloc[0] + '_' + str(grp['Folds'].iloc[0])
auc = auc_group(grp['y_hat'],grp['y'])
result[grp_flag] = auc
Finally the result is in the result dictionary.
Note that str(grp['Folds'].iloc[0]), the str() provide you from TypeError: must be str, not numpy.int64 when you are concatting the grp_flag.

Categories

Resources