Annotated bubble chart from a dataframe - python

I have the following data frame
df_SEDL = pd.DataFrame({
'SE': {0: 'Bug Prediction', 1: 'Code Navigation & Understanding', 2: 'Code Similarity & Clone Detection', 3: 'Security', 4: 'Bug Prediction', 5: 'Code Navigation & Understanding', 6: 'Code Similarity & Clone Detection', 7: 'Security', 8: 'Bug Prediction', 9: 'Code Navigation & Understanding', 10: 'Code Similarity & Clone Detection', 11: 'Security', 12: 'Bug Prediction', 13: 'Code Navigation & Understanding', 14: 'Code Similarity & Clone Detection', 15: 'Security', 16: 'Bug Prediction', 17: 'Code Navigation & Understanding', 18: 'Code Similarity & Clone Detection', 19: 'Security', 20: 'Bug Prediction', 21: 'Code Navigation & Understanding', 22: 'Code Similarity & Clone Detection', 23: 'Security', 24: 'Bug Prediction', 25: 'Code Navigation & Understanding', 26: 'Code Similarity & Clone Detection', 27: 'Security', 28: 'Bug Prediction', 29: 'Code Navigation & Understanding', 30: 'Code Similarity & Clone Detection', 31: 'Security'},
'DL': {0: 'ANN', 1: 'ANN', 2: 'ANN', 3: 'ANN', 4: 'Autoencoder', 5: 'Autoencoder', 6: 'Autoencoder', 7: 'Autoencoder', 8: 'CNN', 9: 'CNN', 10: 'CNN', 11: 'CNN', 12: 'GNN', 13: 'GNN', 14: 'GNN', 15: 'GNN', 16: 'LSTM', 17: 'LSTM', 18: 'LSTM', 19: 'LSTM', 20: 'Other_DL', 21: 'Other_DL', 22: 'Other_DL', 23: 'Other_DL', 24: 'RNN', 25: 'RNN', 26: 'RNN', 27: 'RNN', 28: 'attention mechanism', 29: 'attention mechanism', 30: 'attention mechanism', 31: 'attention mechanism'},
'Count': {0: 2.0, 1: 5.0, 2: 3.0, 3: 1.0, 4: 0.0, 5: 11.0, 6: 6.0, 7: 1.0, 8: 1.0, 9: 9.0, 10: 4.0, 11: 5.0, 12: 0.0, 13: 3.0, 14: 3.0, 15: 1.0, 16: 3.0, 17: 17.0, 18: 9.0, 19: 5.0, 20: 1.0, 21: 3.0, 22: 1.0, 23: 2.0, 24: 1.0, 25: 8.0, 26: 4.0, 27: 3.0, 28: 2.0, 29: 16.0, 30: 4.0, 31: 1.0}
})
I'm trying to plot a bubble chart using the following simple code
fig = plt.figure(figsize= (10,8))
ax = fig.add_subplot(111)
ax.scatter(x="DL", y="SE", s="Count", data=df_SEDL,alpha = 0.7,c=df_SEDL.Count*5000)
#plt.margins(.4)
fig.autofmt_xdate()
plt.show()
but eventually, I've got the shape that I don't want which is
I need help to get the shape exactly like the following
with the same X & Y axis of the first figure, but with bigger bubbles (different sizes according to the information and different colors), with numeric information ( numbers inside each bubble) so exactly as the second figure

The marker size s of scatter is set in units of points. So, if your markers are too small, scale the argument you are passing to s.
Here is an example:
s_scaling = 80
fig = plt.figure(figsize= (10,8))
ax = fig.add_subplot(111)
ax.scatter(x="DL", y="SE", s=df.Count*s_scaling, # scaling the size here
data=df, alpha=0.7, c=df.Count*5000)
Leading to:
If this still is too small, simply adapt the value in s_scaling to your liking.
Now if you want to add the count as text, you can loop over the rows of your df and add text to your axes:
for index, row in df.iterrows():
ax.text(row.DL, row.SE, row.Count, ha='center', va='center')
plt.show()
To further style and position the text have a look at the available options.
Hope that helps!

Related

hvplot.errorbars - linking the error bars to line/scatterplots

I have the following dataframe, containing the mean and standard deviations of data, as well as other descriptors.
{'Person': {0: 'Mark',
1: 'Mark',
2: 'Mark',
3: 'Mark',
4: 'Mark',
5: 'Mark',
6: 'Mark',
7: 'Mark',
8: 'Mark',
9: 'Mark',
10: 'Mark',
11: 'Mark',
12: 'Mark',
13: 'Mark',
14: 'Mark',
15: 'Mark',
16: 'Mark',
17: 'Mark',
18: 'Mark',
19: 'Mark',
20: 'Mark',
21: 'Mark',
22: 'John',
23: 'John',
24: 'John',
25: 'John',
26: 'John',
27: 'John',
28: 'John',
29: 'John',
30: 'John',
31: 'John',
32: 'John',
33: 'John',
34: 'John',
35: 'John',
36: 'John',
37: 'John',
38: 'John',
39: 'John',
40: 'John',
41: 'John',
42: 'John',
43: 'John'},
'Alcohol': {0: 'No',
1: 'No',
2: 'No',
3: 'No',
4: 'No',
5: 'No',
6: 'No',
7: 'No',
8: 'No',
9: 'No',
10: 'No',
11: 'Yes',
12: 'Yes',
13: 'Yes',
14: 'Yes',
15: 'Yes',
16: 'Yes',
17: 'Yes',
18: 'Yes',
19: 'Yes',
20: 'Yes',
21: 'Yes',
22: 'No',
23: 'No',
24: 'No',
25: 'No',
26: 'No',
27: 'No',
28: 'No',
29: 'No',
30: 'No',
31: 'No',
32: 'No',
33: 'Yes',
34: 'Yes',
35: 'Yes',
36: 'Yes',
37: 'Yes',
38: 'Yes',
39: 'Yes',
40: 'Yes',
41: 'Yes',
42: 'Yes',
43: 'Yes'},
'Product': {0: 'Orange',
1: 'Orange',
2: 'Orange',
3: 'Orange',
4: 'Orange',
5: 'Apple',
6: 'Apple',
7: 'Apple',
8: 'Apple',
9: 'Apple',
10: 'Apple',
11: 'Orange',
12: 'Orange',
13: 'Orange',
14: 'Orange',
15: 'Orange',
16: 'Apple',
17: 'Apple',
18: 'Apple',
19: 'Apple',
20: 'Apple',
21: 'Apple',
22: 'Orange',
23: 'Orange',
24: 'Orange',
25: 'Orange',
26: 'Orange',
27: 'Apple',
28: 'Apple',
29: 'Apple',
30: 'Apple',
31: 'Apple',
32: 'Apple',
33: 'Orange',
34: 'Orange',
35: 'Orange',
36: 'Orange',
37: 'Orange',
38: 'Apple',
39: 'Apple',
40: 'Apple',
41: 'Apple',
42: 'Apple',
43: 'Apple'},
'Concentration': {0: 0,
1: 10,
2: 20,
3: 30,
4: 40,
5: 0,
6: 10,
7: 20,
8: 30,
9: 40,
10: 50,
11: 0,
12: 10,
13: 20,
14: 30,
15: 40,
16: 0,
17: 10,
18: 20,
19: 30,
20: 40,
21: 50,
22: 0,
23: 10,
24: 20,
25: 30,
26: 40,
27: 0,
28: 10,
29: 20,
30: 30,
31: 40,
32: 50,
33: 0,
34: 10,
35: 20,
36: 30,
37: 40,
38: 0,
39: 10,
40: 20,
41: 30,
42: 40,
43: 50},
'Response': {0: 4,
1: 10,
2: 25,
3: 31,
4: 48,
5: 10,
6: 22,
7: 35,
8: 46,
9: 56,
10: 61,
11: 24,
12: 30,
13: 45,
14: 51,
15: 68,
16: 30,
17: 42,
18: 55,
19: 66,
20: 76,
21: 81,
22: 17,
23: 23,
24: 38,
25: 44,
26: 61,
27: 23,
28: 35,
29: 48,
30: 59,
31: 69,
32: 74,
33: 37,
34: 43,
35: 58,
36: 64,
37: 81,
38: 43,
39: 55,
40: 68,
41: 79,
42: 89,
43: 94},
'Response mean': {0: 4.333333333,
1: 15.0,
2: 24.33333333,
3: 35.33333333,
4: 45.33333333,
5: 12.33333333,
6: 24.66666667,
7: 34.33333333,
8: 45.0,
9: 57.66666667,
10: 55.66666667,
11: 24.33333333,
12: 35.0,
13: 44.33333333,
14: 55.33333333,
15: 65.33333333,
16: 32.33333333,
17: 44.66666667,
18: 54.33333333,
19: 65.0,
20: 77.66666667,
21: 75.66666667,
22: 17.33333333,
23: 28.0,
24: 37.33333333,
25: 48.33333333,
26: 58.33333333,
27: 25.33333333,
28: 37.66666667,
29: 47.33333333,
30: 58.0,
31: 70.66666667,
32: 68.66666667,
33: 37.33333333,
34: 48.0,
35: 57.33333333,
36: 68.33333333,
37: 78.33333333,
38: 45.33333333,
39: 57.66666667,
40: 67.33333333,
41: 78.0,
42: 90.66666667,
43: 88.66666667},
'Response SD': {0: 1.527525232,
1: 4.582575695,
2: 2.081665999,
3: 4.041451884,
4: 2.516611478,
5: 2.516611478,
6: 3.055050463,
7: 2.081665999,
8: 1.0,
9: 1.527525232,
10: 14.74222959,
11: 1.527525232,
12: 4.582575695,
13: 2.081665999,
14: 4.041451884,
15: 2.516611478,
16: 2.516611478,
17: 3.055050463,
18: 2.081665999,
19: 1.0,
20: 1.527525232,
21: 14.74222959,
22: 1.527525232,
23: 4.582575695,
24: 2.081665999,
25: 4.041451884,
26: 2.516611478,
27: 2.516611478,
28: 3.055050463,
29: 2.081665999,
30: 1.0,
31: 1.527525232,
32: 14.74222959,
33: 1.527525232,
34: 4.582575695,
35: 2.081665999,
36: 4.041451884,
37: 2.516611478,
38: 2.516611478,
39: 3.055050463,
40: 2.081665999,
41: 1.0,
42: 1.527525232,
43: 14.74222959}}
I've starting using hvplot because of the interactivity it offers to explore data, and from what I can tell, its based on Bokeh. In the example code below, I can plot the mean values (using scatter to give me the glyphs) and the line through the glyphs (using line) and overlaying them with the scatter * line code. However, as I need to plot the standard deviations, I'm using the errorbar plot too, which works great (note that I use scatter * line * error). However, when I click on the legend to remove certain data, the scatter and line is removed (note that I have muted_alpha=0 for the line and scatter, but there is no such option for the error bars), but the error bars stay on the plot. Its as though the scatter and line are 'linked' together, but the errorbars isn't.
line = df2.hvplot.line(x='Concentration', y='Response mean', by='Product', groupby= ['Person', 'Alcohol'], muted_alpha=0)
scatter = df2.hvplot.scatter(x='Concentration', y='Response mean', by='Product', groupby=['Person', 'Alcohol'], marker='o', size=40, muted_alpha=0)
error = df2.hvplot.errorbars(x='Concentration', y='Response mean', yerr1='Response SD', by='Product', groupby=['Person', 'Alcohol'])
all_plots = scatter * line * error
all_plots
Can anyone help me to link the errorbars to its corresponding data, so when I click on the legend, the error bars, scatter and line are removed from the plot?
Thank you in advance!

Column Values still shown after .isin()

As requested, here is a minimal reproducable example that will generate the issue of .isin() not dropping the values not in .isin() but just setting them to zero:
import os
import pandas as pd
df_example = pd.DataFrame({'Requesting as': {0: 'Employee', 1: 'Ex- Employee', 2: 'Employee', 3: 'Employee', 4: 'Ex-Employee', 5: 'Employee', 6: 'Employee', 7: 'Employee', 8: 'Ex-Employee', 9: 'Ex-Employee', 10: 'Employee', 11: 'Employee', 12: 'Ex-Employee', 13: 'Ex-Employee', 14: 'Employee', 15: 'Employee', 16: 'Employee', 17: 'Ex-Employee', 18: 'Employee', 19: 'Employee', 20: 'Ex-Employee', 21: 'Employee', 22: 'Employee', 23: 'Ex-Employee', 24: 'Employee', 25: 'Employee', 26: 'Ex-Employee', 27: 'Employee', 28: 'Employee', 29: 'Ex-Employee', 30: 'Employee', 31: 'Employee', 32: 'Ex-Employee', 33: 'Employee', 34: 'Employee', 35: 'Ex-Employee', 36: 'Employee', 37: 'Employee', 38: 'Ex-Employee', 39: 'Employee', 40: 'Employee'}, 'Years of service': {0: -0.4, 1: -0.3, 2: -0.2, 3: 1.0, 4: 1.0, 5: 1.0, 6: 2.0, 7: 2.0, 8: 2.0, 9: 2.0, 10: 3.0, 11: 3.0, 12: 3.0, 13: 4.0, 14: 4.0, 15: 4.0, 16: 5.0, 17: 5.0, 18: 5.0, 19: 5.0, 20: 6.0, 21: 6.0, 22: 6.0, 23: 11.0, 24: 11.0, 25: 11.0, 26: 16.0, 27: 17.0, 28: 18.0, 29: 21.0, 30: 22.0, 31: 23.0, 32: 26.0, 33: 27.0, 34: 28.0, 35: 31.0, 36: 32.0, 37: 33.0, 38: 35.0, 39: 36.0, 40: 37.0}, 'yos_bins': {0: 0, 1: 0, 2: 0, 3: '0-1', 4: '0-1', 5: '0-1', 6: '1-2', 7: '1-2', 8: '1-2', 9: '1-2', 10: '2-3', 11: '2-3', 12: '2-3', 13: '3-4', 14: '3-4', 15: '3-4', 16: '4-5', 17: '4-5', 18: '4-5', 19: '4-5', 20: '5-6', 21: '5-6', 22: '5-6', 23: '10-15', 24: '10-15', 25: '10-15', 26: '15-20', 27: '15-20', 28: '15-20', 29: '20-40', 30: '20-40', 31: '20-40', 32: '20-40', 33: '20-40', 34: '20-40', 35: '20-40', 36: '20-40', 37: '20-40', 38: '20-40', 39: '20-40', 40: '20-40'}})
cut_labels = ['0-1','1-2', '2-3', '3-4', '4-5', '5-6', '6-10', '10-15', '15-20', '20-40']
cut_bins = (0, 1, 2, 3, 4, 5, 6, 10, 15, 20, 40)
df_example['yos_bins'] = pd.cut(df_example['Years of service'], bins=cut_bins, labels=cut_labels)
print(df_example['yos_bins'].value_counts())
print(len(df_example['yos_bins']))
print(len(df_example))
print(df_example['yos_bins'].value_counts())
test = df_example[df_example['yos_bins'].isin(['0-1', '1-2', '2-3'])]
print('test dataframe:\n',test)
print('\n')
print('test value counts of yos_bins:\n', test['yos_bins'].value_counts())
print('\n')
dic_test = test.to_dict()
print(dic_test)
print('\n')
print(test.value_counts())ervr
I have created bins for a column with "years of service":
cut_labels = ['0-1','1-2', '2-3', '3-4', '4-5', '5-6', '6-10', '10-15', '15-20', '20-40']
cut_bins = (0, 1, 2, 3, 4, 5, 6, 10, 15, 20, 40)
df['yos_bins'] = pd.cut(df['Years of service'], bins=cut_bins, labels=cut_labels)
Then I applied .isin() to the dataframe column called 'yos_bins' with the intention to filter for a selection of column values. Excerpt from column in df.
The column I use to slice is called 'yos_bins' (i.e. binned Years of Service). I want to select only 3 ranges (0-1, 1-2, 2-3 years), but apparently there are more ranges included in the column.
To my surprise, when I apply value_counts(), I still get all values of the yos_bins column from the df dataframe (but with 0 counts).
test.yos_bins.value_counts()
Looks like this:
This was not intended, all other bins except the 3 in isin() should have been dropped. The resulting issue is that the 0 values are shown in sns.countplots, so I end up with undesired columns with zero counts.
When I save the df to_excel(), all "10-15" value fields show a "Text Date with 2-Digit Year" error. I do not load that dataframe back into python, so not sure if this could cause the problem?
Does anybody know how I can create the test dataframe that merely consists of the 3 yos_bins values instead of showing all yos_bins values, but some with zeros?
An ugly solution because numpy and pandas are misfeatured in terms of element-wise "is in". In my experience I do the comparison manually with numpy arrays.
yos_bins = np.array(df["yos_bins"])
yos_bins_sel = np.array(["0-1", "1-2", "2-3"])
mask = (yos_bins[:, None] == yos_bins_sel[None, :]).any(1)
df[mask]
Requesting as Years of service yos_bins
3 Employee 1.0 0-1
4 Ex-Employee 1.0 0-1
5 Employee 1.0 0-1
6 Employee 2.0 1-2
7 Employee 2.0 1-2
8 Ex-Employee 2.0 1-2
9 Ex-Employee 2.0 1-2
10 Employee 3.0 2-3
11 Employee 3.0 2-3
12 Ex-Employee 3.0 2-3
Explanation
(using x as yos_bins and y as yos_bins_sel)
x[:, None] == y[None, :]).all(1) is the main takeaway, x[:, None] converts x from shape to (n,) to (n, 1). y[None, :] converts y from shape (m,) to (1, m). Comparing them with == forms a broadcasted element-wise boolean array of shape (n, m), we want our array to be (n,)-shaped, so we apply .any(1) so that the second dimension is compressed to True if at least one of it's booleans is True (which is if the element is in the yos_bins_sel array). You end up with a boolean array which can be used to mask the original Data Frame. Replace x with the array containing the values to be compared and y with the array that the values of x should be contained in and you will be able to do this for any data set.

How to add lines with annotations to candlestick charts when some values are missing?

I'm trying to use Plotly to overlay a marker/line chart on top of my OHLC candle chart.
Code
import plotly.graph_objects as go
import pandas as pd
from datetime import datetime
df = pd.DataFrame(
{'index': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20,
21: 21,
22: 22,
23: 23,
24: 24},
'Date': {0: '2018-09-03',
1: '2018-09-04',
2: '2018-09-05',
3: '2018-09-06',
4: '2018-09-07',
5: '2018-09-10',
6: '2018-09-11',
7: '2018-09-12',
8: '2018-09-13',
9: '2018-09-14',
10: '2018-09-17',
11: '2018-09-18',
12: '2018-09-19',
13: '2018-09-20',
14: '2018-09-21',
15: '2018-09-24',
16: '2018-09-25',
17: '2018-09-26',
18: '2018-09-27',
19: '2018-09-28',
20: '2018-10-01',
21: '2018-10-02',
22: '2018-10-03',
23: '2018-10-04',
24: '2018-10-05'},
'Open': {0: 1.2922067642211914,
1: 1.2867859601974487,
2: 1.2859420776367188,
3: 1.2914056777954102,
4: 1.2928247451782229,
5: 1.292808175086975,
6: 1.3027958869934082,
7: 1.3017443418502808,
8: 1.30451238155365,
9: 1.3110626935958862,
10: 1.3071041107177734,
11: 1.3146650791168213,
12: 1.3166556358337402,
13: 1.3140604496002195,
14: 1.3271400928497314,
15: 1.3080958127975464,
16: 1.3117163181304932,
17: 1.3180439472198486,
18: 1.3169677257537842,
19: 1.3077707290649414,
20: 1.3039510250091553,
21: 1.3043931722640991,
22: 1.2979763746261597,
23: 1.2941633462905884,
24: 1.3022021055221558},
'High': {0: 1.2934937477111816,
1: 1.2870012521743774,
2: 1.2979259490966797,
3: 1.2959914207458496,
4: 1.3024225234985352,
5: 1.3052103519439695,
6: 1.30804443359375,
7: 1.3044441938400269,
8: 1.3120088577270508,
9: 1.3143367767333984,
10: 1.3156682252883911,
11: 1.3171066045761108,
12: 1.3211784362792969,
13: 1.3296104669570925,
14: 1.3278449773788452,
15: 1.3166556358337402,
16: 1.3175750970840454,
17: 1.3196094036102295,
18: 1.3180439472198486,
19: 1.3090718984603882,
20: 1.3097577095031738,
21: 1.3049719333648682,
22: 1.3020155429840088,
23: 1.3036959171295166,
24: 1.310753345489502},
'Low': {0: 1.2856279611587524,
1: 1.2813942432403564,
2: 1.2793285846710205,
3: 1.289723515510559,
4: 1.2918561697006226,
5: 1.289823293685913,
6: 1.2976733446121216,
7: 1.298414707183838,
8: 1.3027619123458862,
9: 1.3073604106903076,
10: 1.3070186376571655,
11: 1.3120776414871216,
12: 1.3120431900024414,
13: 1.3140085935592651,
14: 1.305841088294983,
15: 1.3064552545547483,
16: 1.3097233772277832,
17: 1.3141123056411743,
18: 1.309706211090088,
19: 1.3002548217773438,
20: 1.3014055490493774,
21: 1.2944146394729614,
22: 1.2964619398117063,
23: 1.2924572229385376,
24: 1.3005592823028564},
'Close': {0: 1.292306900024414,
1: 1.2869019508361816,
2: 1.2858428955078125,
3: 1.2914891242980957,
4: 1.2925406694412231,
5: 1.2930254936218262,
6: 1.302643060684204,
7: 1.3015578985214231,
8: 1.304546356201172,
9: 1.311131477355957,
10: 1.307326316833496,
11: 1.3146305084228516,
12: 1.3168463706970217,
13: 1.3141123056411743,
14: 1.327087163925171,
15: 1.30804443359375,
16: 1.3117333650588991,
17: 1.3179919719696045,
18: 1.3172800540924072,
19: 1.3078734874725342,
20: 1.3039000034332275,
21: 1.3043591976165771,
22: 1.2981956005096436,
23: 1.294062852859497,
24: 1.3024225234985352},
'Pivot Price': {0: 1.2934937477111816,
1: np.nan,
2: 1.2793285846710205,
3: np.nan,
4: np.nan,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: np.nan,
11: np.nan,
12: np.nan,
13: 1.3296104669570925,
14: np.nan,
15: np.nan,
16: np.nan,
17: np.nan,
18: np.nan,
19: np.nan,
20: np.nan,
21: np.nan,
22: np.nan,
23: 1.2924572229385376,
24: np.nan}})
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
open=df['Open'],
high=df['High'],
low=df['Low'],
close=df['Close'])])
fig.add_trace(
go.Scatter(mode = "lines+markers",
x=df['Date'],
y=df["Pivot Price"]
))
fig.update_layout(
autosize=False,
width=1000,
height=800,)
fig.show()
This is the current image
This is the desired output/image
I want black line between the markers (pivots). I would also ideally like a value next to each line showing the distance between each pivot but Im not sure how to do this.
For example the distance between the first two pivots round(abs(1.293494 - 1.279329),3) returns 0.014 so I would ideally like this next to the line.
The second is round(abs(1.279329 - 1.329610),3) so the value would be 0.05. I have hand edited the image and added the lines for the first two values to give a visual representation of what Im trying to achieve.
The problem seems to be the missing values. So just use pandas.Series.interpolate in combination with fig.add_annotation to get:
I've included annotations for differences as well. There are surely more elegant ways to do it than with for loops, but it does the job. Let me know if anything is unclear!
import pandas as pd
import numpy as np
import plotly.graph_objects as go
df = pd.DataFrame(
{'index': {0: 0,
1: 1,
2: 2,
3: 3,
4: 4,
5: 5,
6: 6,
7: 7,
8: 8,
9: 9,
10: 10,
11: 11,
12: 12,
13: 13,
14: 14,
15: 15,
16: 16,
17: 17,
18: 18,
19: 19,
20: 20,
21: 21,
22: 22,
23: 23,
24: 24},
'Date': {0: '2018-09-03',
1: '2018-09-04',
2: '2018-09-05',
3: '2018-09-06',
4: '2018-09-07',
5: '2018-09-10',
6: '2018-09-11',
7: '2018-09-12',
8: '2018-09-13',
9: '2018-09-14',
10: '2018-09-17',
11: '2018-09-18',
12: '2018-09-19',
13: '2018-09-20',
14: '2018-09-21',
15: '2018-09-24',
16: '2018-09-25',
17: '2018-09-26',
18: '2018-09-27',
19: '2018-09-28',
20: '2018-10-01',
21: '2018-10-02',
22: '2018-10-03',
23: '2018-10-04',
24: '2018-10-05'},
'Open': {0: 1.2922067642211914,
1: 1.2867859601974487,
2: 1.2859420776367188,
3: 1.2914056777954102,
4: 1.2928247451782229,
5: 1.292808175086975,
6: 1.3027958869934082,
7: 1.3017443418502808,
8: 1.30451238155365,
9: 1.3110626935958862,
10: 1.3071041107177734,
11: 1.3146650791168213,
12: 1.3166556358337402,
13: 1.3140604496002195,
14: 1.3271400928497314,
15: 1.3080958127975464,
16: 1.3117163181304932,
17: 1.3180439472198486,
18: 1.3169677257537842,
19: 1.3077707290649414,
20: 1.3039510250091553,
21: 1.3043931722640991,
22: 1.2979763746261597,
23: 1.2941633462905884,
24: 1.3022021055221558},
'High': {0: 1.2934937477111816,
1: 1.2870012521743774,
2: 1.2979259490966797,
3: 1.2959914207458496,
4: 1.3024225234985352,
5: 1.3052103519439695,
6: 1.30804443359375,
7: 1.3044441938400269,
8: 1.3120088577270508,
9: 1.3143367767333984,
10: 1.3156682252883911,
11: 1.3171066045761108,
12: 1.3211784362792969,
13: 1.3296104669570925,
14: 1.3278449773788452,
15: 1.3166556358337402,
16: 1.3175750970840454,
17: 1.3196094036102295,
18: 1.3180439472198486,
19: 1.3090718984603882,
20: 1.3097577095031738,
21: 1.3049719333648682,
22: 1.3020155429840088,
23: 1.3036959171295166,
24: 1.310753345489502},
'Low': {0: 1.2856279611587524,
1: 1.2813942432403564,
2: 1.2793285846710205,
3: 1.289723515510559,
4: 1.2918561697006226,
5: 1.289823293685913,
6: 1.2976733446121216,
7: 1.298414707183838,
8: 1.3027619123458862,
9: 1.3073604106903076,
10: 1.3070186376571655,
11: 1.3120776414871216,
12: 1.3120431900024414,
13: 1.3140085935592651,
14: 1.305841088294983,
15: 1.3064552545547483,
16: 1.3097233772277832,
17: 1.3141123056411743,
18: 1.309706211090088,
19: 1.3002548217773438,
20: 1.3014055490493774,
21: 1.2944146394729614,
22: 1.2964619398117063,
23: 1.2924572229385376,
24: 1.3005592823028564},
'Close': {0: 1.292306900024414,
1: 1.2869019508361816,
2: 1.2858428955078125,
3: 1.2914891242980957,
4: 1.2925406694412231,
5: 1.2930254936218262,
6: 1.302643060684204,
7: 1.3015578985214231,
8: 1.304546356201172,
9: 1.311131477355957,
10: 1.307326316833496,
11: 1.3146305084228516,
12: 1.3168463706970217,
13: 1.3141123056411743,
14: 1.327087163925171,
15: 1.30804443359375,
16: 1.3117333650588991,
17: 1.3179919719696045,
18: 1.3172800540924072,
19: 1.3078734874725342,
20: 1.3039000034332275,
21: 1.3043591976165771,
22: 1.2981956005096436,
23: 1.294062852859497,
24: 1.3024225234985352},
'Pivot Price': {0: 1.2934937477111816,
1: np.nan,
2: 1.2793285846710205,
3: np.nan,
4: np.nan,
5: np.nan,
6: np.nan,
7: np.nan,
8: np.nan,
9: np.nan,
10: np.nan,
11: np.nan,
12: np.nan,
13: 1.3296104669570925,
14: np.nan,
15: np.nan,
16: np.nan,
17: np.nan,
18: np.nan,
19: np.nan,
20: np.nan,
21: np.nan,
22: np.nan,
23: 1.2924572229385376,
24: np.nan}})
import plotly.graph_objects as go
import pandas as pd
from datetime import datetime
# df=pd.read_csv("for_so.csv")
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
# fig = go.Figure(data=[go.Candlestick(x=df.index,
open=df['Open'],
high=df['High'],
low=df['Low'],
close=df['Close'])])
# some calculations
df_diff = df['Pivot Price'].dropna().diff().copy()
df2 = df[df.index.isin(df_diff.index)].copy()
df2['Price Diff'] = df['Pivot Price'].dropna().values
fig.add_trace(
go.Scatter(mode = "lines+markers",
x=df['Date'],
y=df["Pivot Price"]
))
fig.update_layout(
autosize=False,
width=1000,
height=800,)
fig.add_trace(go.Scatter(x=df['Date'], y=df['Pivot Price'].interpolate(),
# fig.add_trace(go.Scatter(x=df.index, y=df['Pivot Price'].interpolate(),
mode = 'lines',
line = dict(color='black')))
def annot(value):
# print(type(value))
if np.isnan(value):
return ''
else:
return value
j = 0
for i, p in enumerate(df['Pivot Price']):
# print(p)
# if not np.isnan(p) and not np.isnan(df_diff.iloc[j]):
if not np.isnan(p):
# print(not np.isnan(df_diff.iloc[j]))
fig.add_annotation(dict(font=dict(color='rgba(0,0,200,0.8)',size=12),
x=df['Date'].iloc[i],
# x=df.index[i],
# x = xStart
y=p,
showarrow=False,
text=annot(round(abs(df_diff.iloc[j]),3)),
textangle=0,
xanchor='right',
xref="x",
yref="y"))
j = j + 1
fig.update_xaxes(type='category')
fig.show()
Problem seems the missing values, plotly has difficulty with. With this trick you can only plot the point;
has_value = ~df["Pivot Price"].isna()
import plotly.graph_objects as go
import pandas as pd
from datetime import datetime
df=pd.read_csv("notebooks/for_so.csv")
fig = go.Figure(data=[go.Candlestick(x=df['Date'],
open=df['Open'],
high=df['High'],
low=df['Low'],
close=df['Close'])])
fig.add_trace(
go.Scatter(mode = 'lines',
x=df[has_value]['Date'],
y=df[has_value]["Pivot Price"], line={'color':'black', 'width':1}
))
fig.add_trace(
go.Scatter(mode = "markers",
x=df['Date'],
y=df["Pivot Price"]
))
fig.update_layout(
autosize=False,
width=1000,
height=800,)
fig.show()
This did it for me.

Change size of scatterplot marker based on column value - Python 3.6.x

I have a dataset that looks like this:
{'ScoreDate': {0: '12/1/2019',
1: '1/1/2020',
2: '2/1/2020',
3: '3/1/2020',
4: '4/1/2020',
5: '5/1/2020',
6: '6/1/2020',
7: '7/1/2020',
8: '7/1/2020',
9: '7/1/2020',
10: '7/1/2020',
11: '7/1/2020',
12: '7/1/2020',
13: '8/1/2020',
14: '8/1/2020',
15: '8/1/2020',
16: '8/1/2020',
17: '8/1/2020',
18: '9/1/2020'},
'CustomerID': {0: 4554,
1: 4554,
2: 4554,
3: 4554,
4: 4554,
5: 4554,
6: 4554,
7: 4554,
8: 4554,
9: 4554,
10: 4554,
11: 4554,
12: 4554,
13: 4554,
14: 4554,
15: 4554,
16: 4554,
17: 4554,
18: 4554},
'Supplier_Name': {0: 'ABC Company',
1: 'ABC Company',
2: 'ABC Company',
3: 'ABC Company',
4: 'ABC Company',
5: 'ABC Company',
6: 'ABC Company',
7: 'ABC Company',
8: 'ABC Company',
9: 'ABC Company',
10: 'ABC Company',
11: 'ABC Company',
12: 'ABC Company',
13: 'ABC Company',
14: 'ABC Company',
15: 'ABC Company',
16: 'ABC Company',
17: 'ABC Company',
18: 'ABC Company'},
'Score': {0: 90,
1: 90,
2: 90,
3: 75,
4: 75,
5: 75,
6: 90,
7: 90,
8: 90,
9: 90,
10: 90,
11: 90,
12: 90,
13: 90,
14: 90,
15: 90,
16: 90,
17: 90,
18: 90},
'EDate': {0: nan,
1: nan,
2: nan,
3: nan,
4: '4/1/2020',
5: nan,
6: '6/1/2020',
7: '7/1/2020',
8: '7/1/2020',
9: '7/1/2020',
10: '7/1/2020',
11: '7/1/2020',
12: '7/1/2020',
13: '8/1/2020',
14: '8/1/2020',
15: '8/1/2020',
16: '8/1/2020',
17: '8/1/2020',
18: nan}}
And some code to produce a line plot of the Score with markers for each EDate:
size = 15
params = {'legend.fontsize': 'large',
'figure.figsize': (20,8),
'axes.labelsize': size,
'axes.titlesize': size,
'xtick.labelsize': size*0.75,
'ytick.labelsize': size*0.75,
'axes.titlepad': 25}
plt.figure(figsize=(10,5))
sns.set(style="darkgrid")
plt.rcParams.update(params)
sns.lineplot(data=df, x='ScoreDate', y='Score', ci=None,
linewidth=2, palette="deep").set(title="Score")
sns.scatterplot(data=df, x='EDate', y='Score', color='orange')
Which produces:
I am looking to accomplish:
Setting the marker size equal to how many EDates (events) occurred for that date
I have successfully grouped the data using:
c_df = df.groupby(['ScoreDate', 'Score'])['EDate'].count().reset_index(name='count')
size = 15
params = {'legend.fontsize': 'large',
'figure.figsize': (20,8),
'axes.labelsize': size,
'axes.titlesize': size,
'xtick.labelsize': size*0.75,
'ytick.labelsize': size*0.75,
'axes.titlepad': 25}
plt.figure(figsize=(10,5))
sns.set(style="darkgrid")
plt.rcParams.update(params)
sns.lineplot(data=c_df, x='ScoreDate', y='Score', ci=None,
linewidth=2, palette="deep").set(title="Score")
sns.scatterplot(data=c_df, x='ScoreDate', y='count', color='orange')
Which produces:
Which is clearly not what I am looking for. How can I accomplish my three objectives?
I believe you're looking for the size parameter:
sns.lineplot(data=df, x='ScoreDate', y='Score', ci=None,
linewidth=2, palette="deep").set(title="Score")
sns.scatterplot(data=c_df, x='ScoreDate', y='Score', size='count', color='orange')
Output:
Note: You can also specify the sizes (e.g. sizes=[0,30,60,90]) parameter to manually set the desired sizes for each count group. So, for example:
See that marker sizes is different (the zeros, for example, not show at all). Alternatively, you can just filter them out from c_df with c_df.query('count>0') for plotting.

Nested for loops to create muliple pivot table based on 2 level multiindex in pandas

Started getting confused with this one. I have a large Fact Invoice Header table. I took the original dataframe, used a groupby to split the df up based upon one column. The output was a list of dataframes:
list_of_dfs = []
for _, g in df.groupby(df['Project State Name']):
list_of_dfs.append(g)
list_of_dfs
Then I used a another for loop to loop through the list of dataframes and perform one pivot table aggregation.
for each_state_df in list_of_dfs:
columns_to_index_by = ['Project Issue', 'Project Secondary Issue', 'Project Client Name']
# Aggregating to the Project Level
table_for_pivots = pd.pivot_table(df, index=['FY Year', 'Project Issue'], values=["Project Key", 'Total Net Amount', "Project Total Resolution Amount", 'Project Budgeted Amount'],
aggfunc= {"Project Key": lambda x: len(x.unique()), 'Total Net Amount': np.sum, "Project Total Resolution Amount": np.mean,
'Project Budgeted Amount': np.mean},
fill_value=np.mean)
print(table_for_pivots)
My question is, how can I use another for loop replace the second element in the pivot table index with each value in the variable columns_to_index_by? The output would be 3 pivot tables where index=[‘FY Year’, ‘Project Issue’], index=[‘FY Year’, ‘Project Secondary Issue’, and index=[‘FY Year’, ‘Project Client Name’]. Thanks all!
Link to download a sample df data is here:
https://ufile.io/iufv9nma
Use list comprehension and iterate through a zip of the index you want to set for each group:
from pandas import Timestamp
from numpy import nan
d = {'Total Net Amount': {2: 672.0, 41: 1277.9, 17: 270.0, 32: 845.3, 26: 828.62, 11: 733.5, 23: 1741.8, 35: 254.14655, 29: 245.0, 59: 215.0, 38: 617.4, 0: 1061.5}, 'Project Total Resolution Amount': {2: 35000, 41: 27000, 17: 40000, 32: 27000, 26: 27000, 11: 40000, 23: 27000, 35: 27000, 29: 27000, 59: 27000, 38: 27000, 0: 30000}, 'Invoice Header Key': {2: 1229422, 41: 984803, 17: 1270731, 32: 938069, 26: 911535, 11: 1247443, 23: 902150, 35: 943737, 29: 918888, 59: 1071541, 38: 965091, 0: 1279581}, 'Project Key': {2: 259661, 41: 194517, 17: 259188, 32: 194517, 26: 194517, 11: 259188, 23: 194517, 35: 194517, 29: 194517, 59: 194517, 38: 194517, 0: 263736}, 'Project Secondary Issue': {2: 2, 41: 4, 17: 0, 32: 3, 26: 3, 11: 0, 23: 4, 35: 4, 29: 4, 59: 4, 38: 3, 0: 4}, 'Organization Key': {2: 16029, 41: 22638, 17: 24230, 32: 22638, 26: 22638, 11: 24230, 23: 22638, 35: 22638, 29: 22638, 59: 22638, 38: 22638, 0: 4532}, 'Project Budgeted Amount': {2: 42735.0, 41: 32500.0, 17: 26000.0, 32: 32500.0, 26: 32500.0, 11: 26000.0, 23: 32500.0, 35: 32500.0, 29: 32500.0, 59: 32500.0, 38: 32500.0, 0: nan}, 'Project State Name': {2: 0, 41: 1, 17: 2, 32: 1, 26: 1, 11: 2, 23: 1, 35: 1, 29: 1, 59: 1, 38: 1, 0: 1}, 'Project Issue': {2: 0, 41: 2, 17: 1, 32: 2, 26: 2, 11: 1, 23: 2, 35: 2, 29: 2, 59: 2, 38: 2, 0: 1}, 'Project Number': {2: 2, 41: 0, 17: 1, 32: 0, 26: 0, 11: 1, 23: 0, 35: 0, 29: 0, 59: 0, 38: 0, 0: 3}, 'Project Client Name': {2: 1, 41: 0, 17: 0, 32: 0, 26: 0, 11: 0, 23: 0, 35: 0, 29: 0, 59: 0, 38: 0, 0: 1}, 'Paid Date Year Month': {2: 13, 41: 7, 17: 15, 32: 4, 26: 2, 11: 14, 23: 1, 35: 5, 29: 3, 59: 12, 38: 6, 0: 16}, 'FY Year': {2: 2, 41: 0, 17: 2, 32: 0, 26: 0, 11: 2, 23: 0, 35: 0, 29: 0, 59: 1, 38: 0, 0: 2}, 'Invoice Paid Date': {2: Timestamp('2019-09-10 00:00:00'), 41: Timestamp('2017-12-20 00:00:00'), 17: Timestamp('2019-11-25 00:00:00'), 32: Timestamp('2017-08-31 00:00:00'), 26: Timestamp('2017-06-14 00:00:00'), 11: Timestamp('2019-10-08 00:00:00'), 23: Timestamp('2017-05-30 00:00:00'), 35: Timestamp('2017-09-07 00:00:00'), 29: Timestamp('2017-07-10 00:00:00'), 59: Timestamp('2018-10-03 00:00:00'), 38: Timestamp('2017-11-03 00:00:00'), 0: Timestamp('2019-12-12 00:00:00')}, 'Invoice Paid Date Key': {2: 20190910, 41: 20171220, 17: 20191125, 32: 20170831, 26: 20170614, 11: 20191008, 23: 20170530, 35: 20170907, 29: 20170710, 59: 20181003, 38: 20171103, 0: 20191212}, 'Count Project Secondary Issue': {2: 3, 41: 3, 17: 3, 32: 3, 26: 3, 11: 3, 23: 3, 35: 3, 29: 3, 59: 3, 38: 3, 0: 2}, 'Total Net Amount By Count Project Secondary Issue': {2: 224.0, 41: 425.9666666666667, 17: 90.0, 32: 281.7666666666667, 26: 276.2066666666666, 11: 244.5, 23: 580.6, 35: 84.71551666666666, 29: 81.66666666666667, 59: 71.66666666666667, 38: 205.8, 0: 530.75}, 'Total Net Invoice Amount': {2: 672.0, 41: 1277.9, 17: 270.0, 32: 845.3, 26: 828.62, 11: 733.5, 23: 1741.8, 35: 254.14655, 29: 245.0, 59: 215.0, 38: 617.4, 0: 1061.5}, 'Total Project Invoice Amount': {2: 7176.52, 41: 10110.98655, 17: 1678.5, 32: 10110.98655, 26: 10110.98655, 11: 1678.5, 23: 10110.98655, 35: 10110.98655, 29: 10110.98655, 59: 10110.98655, 38: 10110.98655, 0: 1061.5}, 'Invoice Dollar Percent of Project': {2: 0.09363869953682286, 41: 0.1263872712796755, 17: 0.160857908847185, 32: 0.08360212881501655, 26: 0.08195243816242638, 11: 0.4369973190348526, 23: 0.1722680562758735, 35: 0.02513568272919916, 29: 0.02423106773888449, 59: 0.02126399821983741, 38: 0.06106229070198891, 0: 1.0}}
df = pd.DataFrame(d)
# list comprehension with groupby
group = [g for _, g in df.groupby('Project State Name')]
#create a list of indices you want to use in pivot
idx = [['FY Year', 'Project Issue'],
['FY Year', 'Project Secondary Issue'],
['FY Year', 'Project Client Name']]
# create a list of columns to add to the value param in pivot
values = ["Project Key", 'Total Net Amount',
"Project Total Resolution Amount", 'Project Budgeted Amount']
# use your current pivot and iterate through zip(idx, group)
dfs = [pd.pivot_table(df, index=i, values=values,
aggfunc= {"Project Key": lambda x: len(x.unique()), 'Total Net Amount': np.sum,
"Project Total Resolution Amount": np.mean,
'Project Budgeted Amount': np.mean},
fill_value=np.mean) for i,df in zip(idx, group)]
dict comprehension
I did not know what you wanted the key to be so I just selected the second value from idx. You will call each dataframe from the dict by dfs['Project Issue']
dfs = {i[1]: pd.pivot_table(df, index=i, values=values,
aggfunc= {"Project Key": lambda x: len(x.unique()), 'Total Net Amount': np.sum,
"Project Total Resolution Amount": np.mean,
'Project Budgeted Amount': np.mean},
fill_value=np.mean) for i,df in zip(idx, group)}

Categories

Resources