Multiple pie charts from pandas dataframe - python

I have the following dataframe:
df = pd.DataFrame({'REC2': {0: '18-24',
1: '18-24',
2: '25-34',
3: '25-34',
4: '35-44',
5: '35-44',
6: '45-54',
7: '45-54',
8: '55-64',
9: '55-64',
10: '65+',
11: '65+'},
'Q8_1': {0: 'No',
1: 'Yes',
2: 'No',
3: 'Yes',
4: 'No',
5: 'Yes',
6: 'No',
7: 'Yes',
8: 'No',
9: 'Yes',
10: 'No',
11: 'Yes'},
'val': {0: 0.9642857142857143,
1: 0.03571428571428571,
2: 0.8208955223880597,
3: 0.1791044776119403,
4: 0.8507462686567164,
5: 0.14925373134328357,
6: 0.8484848484848485,
7: 0.15151515151515152,
8: 0.8653846153846154,
9: 0.1346153846153846,
10: 0.9375,
11: 0.0625}})
which looks like this:
I am trying to create a separate pie chart for each age bin. Currently I am using a hardcoded version, where I need to type in all the available bins. However, I am looking for a solution that does this within a loop or automatically asigns the correct bins. This is my current solution:
df = data.pivot_table(values="val",index=["REC2","Q8_1"])
rcParams['figure.figsize'] = (6,10)
f, a = plt.subplots(3,2)
df.xs('18-24').plot(kind='pie',ax=a[0,0],y="val")
df.xs('25-34').plot(kind='pie',ax=a[1,0],y="val")
df.xs('35-44').plot(kind='pie',ax=a[2,0],y="val")
df.xs('45-54').plot(kind='pie',ax=a[0,1],y="val")
df.xs('55-64').plot(kind='pie',ax=a[1,1],y="val")
df.xs('65+').plot(kind='pie',ax=a[2,1],y="val")
Output:

I think you want:
df.groupby('REC2').plot.pie(x='Q8_1', y='val', layout=(2,3))
Update: I take a look and it turns out that groupby.plot does a different thing. So you can try the for loop:
df = df.set_index("Q8_1")
f, a = plt.subplots(3,2)
for age, ax in zip(set(df.REC2), a.ravel()):
df[df.REC2.eq(age)].plot.pie( y='val', ax=ax)
plt.show()
which yields:

Related

How to Fix Code to Avoid Stubnames Error (Python Pandas)?

dt = {'id': {0: 'x1', 1: 'x2', 2: 'x3', 3: 'x4', 4: 'x5', 5: 'x6', 6: 'x7', 7: 'x8', 8: 'x9', 9: 'x10'}, 'trt': {0: 'cnt', 1: 'cnt', 2: 'tr', 3: 'tr', 4: 'tr', 5: 'cnt', 6: 'tr', 7: 'tr', 8: 'cnt', 9: 'cnt'}, 'work.T1': {0: 0.6516556669957936, 1: 0.567737752571702, 2: 0.1135089821182191, 3: 0.5959253052715212, 4: 0.3580499750096351, 5: 0.4288094183430075, 6: 0.0519033221062272, 7: 0.2641776674427092, 8: 0.3987907308619469, 9: 0.8361341434065253}, 'play.T1': {0: 0.8647212258074433, 1: 0.6153524168767035, 2: 0.7751098964363337, 3: 0.3555686913896352, 4: 0.4058499720413238, 5: 0.7066469138953835, 6: 0.8382876652758569, 7: 0.2395891312044114, 8: 0.7707715332508087, 9: 0.3558977444190532}, 'talk.T1': {0: 0.5355970377568156, 1: 0.0930881295353174, 2: 0.169803041499108, 3: 0.8998324507847428, 4: 0.4226376069709658, 5: 0.7477464678231627, 6: 0.8226525799836963, 7: 0.9546536463312804, 8: 0.6854445093777031, 9: 0.5005032296758145}, 'work.T2': {0: 0.2754838624969125, 1: 0.2289039448369294, 2: 0.0144339059479534, 3: 0.7289645625278354, 4: 0.2498804717324674, 5: 0.1611832766793668, 6: 0.0170426501426845, 7: 0.4861003451514989, 8: 0.1029001718852669, 9: 0.8015470046084374}, 'play.T2': {0: 0.3543280649464577, 1: 0.9364325392525644, 2: 0.2458663922734558, 3: 0.4731414613779634, 4: 0.191560871200636, 5: 0.5832219698932022, 6: 0.4594731898978352, 7: 0.467434047954157, 8: 0.3998325555585325, 9: 0.5052855962421745}, 'talk.T2': {0: 0.0318881559651345, 1: 0.1144675880204886, 2: 0.468935475917533, 3: 0.3969867376144975, 4: 0.8336191941052675, 5: 0.7611217433586717, 6: 0.5733564489055425, 7: 0.447508045937866, 8: 0.0838020080700516, 9: 0.2191385473124683}}
mydt = pd.DataFrame(dt, columns = ['id', 'trt', 'work.T1', '', 'play.T1', 'talk.T1','work.T2', '', 'play.T2', 'talk.T2'])
So I have the above dataset and need to tidy it up. I have used the following code but it returns "ValueError: stubname can't be identical to a column name." How can I fix the code to avoid this problem?
names = ['play', 'talk', 'work']
activities = pd.wide_to_long(dt, stubnames=names, i='id', j='time', sep='.', suffix='T\d').sort_index().reset_index()
activities
Note: I am trying to get the dataframe to look like the following.
Changed :
activities = pd.wide_to_long(activities, stubnames=names, i='id', j='time', sep='.', suffix='T\d').sort_index().reset_index()
To:
activities = pd.wide_to_long(mydt, stubnames=names, i='id', j='time', sep='.', suffix='T\d').sort_index().reset_index()
and then it works.

How can I use get_valid_primitives when I have only one dataframe in Featuretools?

I am trying to figure out how Featuretools works and I am testing it on the Housing Prices dataset on Kaggle. Because the dataset is huge, I'll work here with only a set of it.
The dataframe is:
train={'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60}, 'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'}, 'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0}, 'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}}
I create an EntitySet for this dataframe:
es_train = ft.EntitySet()
I add the dataframe to the created EntitySet:
es_train.add_dataframe(dataframe_name='train', dataframe=train, index='Id')
Then I call the function:
ap, tp = ft.get_valid_primitives(entityset=es_train, target_dataframe_name='train')
And here it all breaks up, because I get the following error message:
KeyError: 'DataFrame train does not exist in entity set'
I tried to study the tutorials on the Featuretools site, but all I could find are tutorials with multiple dataframes, so it didn't help me at all.
Where am I mistaking? How can I correct the mistake(s)?
Thanks!
Later edit: I am using PyCharm. When I work in script mode, I get the error above. However, when I use the command line, everything works perfectly.
The only issue I see with your code is that you're not wrapping your train object with pd.Dataframe
This code works well for me:
import featuretools as ft
import pandas as pd
train=pd.DataFrame({
'Id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'MSSubClass': {0: 60, 1: 20, 2: 60, 3: 70, 4: 60},
'MSZoning': {0: 'RL', 1: 'RL', 2: 'RL', 3: 'RL', 4: 'RL'},
'LotFrontage': {0: 65.0, 1: 80.0, 2: 68.0, 3: 60.0, 4: 84.0},
'LotArea': {0: 8450, 1: 9600, 2: 11250, 3: 9550, 4: 14260}
})
es_train = ft.EntitySet()
es_train.add_dataframe(dataframe_name='train', dataframe=train, index='Id')
_, tp = ft.get_valid_primitives(entityset=es_train, target_dataframe_name='train')
for p in tp:
print(p.name)

Join pandas dataframes according to common index value only

I have the following dataframes (this is just test data), in real samples, I have index values that are repeated a few times inside dataframe 1 and dataframe 2 - this causes the repeated/duplicate rows inside final dataframe.
DataFrame 1:
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'first_name': {0: 'Jennee',
1: 'Dagny',
2: 'Correy',
3: 'Pall',
4: 'Julie',
5: 'Janene',
6: 'Lemmy',
7: 'Coleman',
8: 'Beck',
9: 'Che'},
'last_name': {0: 'Strelitzki',
1: 'Dunsire',
2: 'Wickrath',
3: 'Jopp',
4: 'Gheeraert',
5: 'Gawith',
6: 'Farrow',
7: 'Legging',
8: 'Beckwith',
9: 'Burgoin'},
'email': {0: 'jstrelitzki0#google.de',
1: 'ddunsire1#geocities.com',
2: 'cwickrath2#github.com',
3: 'pjopp3#infoseek.co.jp',
4: 'jgheeraert4#theatlantic.com',
5: 'jgawith5#sciencedirect.com',
6: 'lfarrow6#wikimedia.org',
7: 'clegging7#businessinsider.com',
8: 'bbeckwith8#zdnet.com',
9: 'cburgoin9#reference.com'},
'gender': {0: 'Male',
1: 'Female',
2: 'Female',
3: 'Female',
4: 'Female',
5: 'Female',
6: 'Male',
7: 'Female',
8: 'Polygender',
9: 'Male'},
'ip_address': {0: '8.99.68.120',
1: '188.238.129.48',
2: '87.159.243.249',
3: '66.37.174.94',
4: '233.77.128.104',
5: '190.202.131.98',
6: '84.175.231.196',
7: '140.178.100.5',
8: '81.211.179.167',
9: '31.219.69.206'},
'Boolean': {0: False,
1: False,
2: True,
3: True,
4: False,
5: True,
6: True,
7: False,
8: False,
9: False}})
DataFrame 2:
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'Model': {0: 2005,
1: 2007,
2: 2011,
3: 2003,
4: 1998,
5: 1992,
6: 1992,
7: 1992,
8: 2008,
9: 1996},
'Make': {0: 'Cadillac',
1: 'Lexus',
2: 'Dodge',
3: 'Dodge',
4: 'Oldsmobile',
5: 'Volkswagen',
6: 'Chevrolet',
7: 'Suzuki',
8: 'Ford',
9: 'Mazda'},
'Colour': {0: 'Red',
1: 'Red',
2: 'Crimson',
3: 'Red',
4: 'Purple',
5: 'Crimson',
6: 'Red',
7: 'Aquamarine',
8: 'Puce',
9: 'Maroon'}})
The two dataframes should be connected based on common Index values found in both dataframes only. Which means, any index values that don't match in those two dataframes; should not appear in the final combined/merged dataframe.
I want to ensure that the final dataframe is unique, and only captures combinations of columns, based on unique Index values.
When I try using the following code, the output is supposed to 'inner join' based on the unique index found in both dataframes.
final = pd.merge(df1, df2, left_index=True, right_index=True)
However, when I try applying the above merge technique on my larger (other) pandas dataframes, there are many rows being repeated/duplicated multiple times. When the merging happpens a few times with more dataframes, the rows gets repeated very frequently, with the same Index value.
I am expecting to see one Index value returned per row (with all the column combinations from each dataframe).
I am not sure why this happens. I can confirm that there is nothing wrong with the datasets.
Is there a better technique of merging those two dataframes, based on only common index values, and at the same time ensure that I don't repeat any rows (with the same index) in my final dataframe ? I often find that this merging often creates a giant final CSV file around 20GB in size too. The source files are only around 15MB into total.
Any help is much appreciated.
My end output should look like this (please copy and use this as Pandas DF):
pd.DataFrame({'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'first_name': {0: 'Jennee',
1: 'Dagny',
2: 'Correy',
3: 'Pall',
4: 'Julie',
5: 'Janene',
6: 'Lemmy',
7: 'Coleman',
8: 'Beck',
9: 'Che'},
'last_name': {0: 'Strelitzki',
1: 'Dunsire',
2: 'Wickrath',
3: 'Jopp',
4: 'Gheeraert',
5: 'Gawith',
6: 'Farrow',
7: 'Legging',
8: 'Beckwith',
9: 'Burgoin'},
'email': {0: 'jstrelitzki0#google.de',
1: 'ddunsire1#geocities.com',
2: 'cwickrath2#github.com',
3: 'pjopp3#infoseek.co.jp',
4: 'jgheeraert4#theatlantic.com',
5: 'jgawith5#sciencedirect.com',
6: 'lfarrow6#wikimedia.org',
7: 'clegging7#businessinsider.com',
8: 'bbeckwith8#zdnet.com',
9: 'cburgoin9#reference.com'},
'gender': {0: 'Male',
1: 'Female',
2: 'Female',
3: 'Female',
4: 'Female',
5: 'Female',
6: 'Male',
7: 'Female',
8: 'Polygender',
9: 'Male'},
'ip_address': {0: '8.99.68.120',
1: '188.238.129.48',
2: '87.159.243.249',
3: '66.37.174.94',
4: '233.77.128.104',
5: '190.202.131.98',
6: '84.175.231.196',
7: '140.178.100.5',
8: '81.211.179.167',
9: '31.219.69.206'},
'Boolean': {0: False,
1: False,
2: True,
3: True,
4: False,
5: True,
6: True,
7: False,
8: False,
9: False},
'Model': {0: 2005,
1: 2007,
2: 2011,
3: 2003,
4: 1998,
5: 1992,
6: 1992,
7: 1992,
8: 2008,
9: 1996},
'Make': {0: 'Cadillac',
1: 'Lexus',
2: 'Dodge',
3: 'Dodge',
4: 'Oldsmobile',
5: 'Volkswagen',
6: 'Chevrolet',
7: 'Suzuki',
8: 'Ford',
9: 'Mazda'},
'Colour': {0: 'Red',
1: 'Red',
2: 'Crimson',
3: 'Red',
4: 'Purple',
5: 'Crimson',
6: 'Red',
7: 'Aquamarine',
8: 'Puce',
9: 'Maroon'}})
This is expected behavior with non-unique idx values. Since you have 3 ID1 rows in one df and 2 ID1 in the other, you end up with 6 ID1 rows in your merged df. If you add validate="one_to_one" to pd.merge() you will get this Error. MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one mergeAll other validations fail except for many to many.
If it makes sense for your data, you can use the left_on, and right_on parameters to find unique combinations and give you a one-to-one if that's what you're after.
Edit after your new data:
Now that you have unique ids, this should work for you. Notice it doesn't throw a validation error.
final = pd.merge(df1, df2, left_on=['id'], right_on=['id'], validate='one_to_one')

How to specify legend based on different groups with Matplotlib or Seaborn

I have a dataset that looks like the following:
df = {'tic': {0: 'A',
1: 'AAPL',
2: 'ABC',
3: 'ABT',
4: 'ADBE',
5: 'ADI',
6: 'ADM',
7: 'ADP',
8: 'ADSK',
9: 'AEE'},
'Class': {0: 'Manufacturing',
1: 'Tech',
2: 'Trade',
3: 'Manufacturing',
4: 'Services',
5: 'Tech',
6: 'Manufacturing',
7: 'Services',
8: 'Services',
9: 'Electricity and Transportation'},
'Color': {0: 'blue',
1: 'teal',
2: 'purple',
3: 'blue',
4: 'red',
5: 'teal',
6: 'blue',
7: 'red',
8: 'red',
9: 'orange'},
'Pooled 1': {0: 0.0643791550056838,
1: 0.05022103288830682,
2: 0.039223739393748916,
3: 0.036366693834970217,
4: 0.05772708899447428,
5: 0.05969899935101172,
6: 0.04568101605219955,
7: 0.04542272002937567,
8: 0.07138013872431757,
9: 0.029987722053015278}}
I want to produce a bat plot with the values stored in Pooled 1. But I would like to color the bars with the colors stored in Color. All bars of the same Class should have the same color and should be plotted together. I am only showing part of the dataset above.
The code I am using is the following:
fig, axs = plt.subplots(1,1,figsize = (24, 5))
tmp_df = df.sort_values('Class')
plt.bar(np.arange(len(df)), tmp_df['Pooled 1'], color = tmp_df['Color'])
It produces almost the desired output:
I would like to have a legend with the names in Class and the colors from Color. I know that seaborn can do that with barplot but then it won't follow the desired colors. And I don't know why but barplot takes a long time to plot the dataset. Matplotlib is super quick though.
What's the best way to add a legend in this case? Thanks in advance!
You can assign a label to the first bar of each class. Matplotlib will use these labels to create a legend:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
df = pd.DataFrame({'tic': {0: 'A', 1: 'AAPL', 2: 'ABC', 3: 'ABT', 4: 'ADBE', 5: 'ADI', 6: 'ADM', 7: 'ADP', 8: 'ADSK', 9: 'AEE'}, 'Class': {0: 'Manufacturing', 1: 'Tech', 2: 'Trade', 3: 'Manufacturing', 4: 'Services', 5: 'Tech', 6: 'Manufacturing', 7: 'Services', 8: 'Services', 9: 'Electricity and Transportation'}, 'Color': {0: 'blue', 1: 'teal', 2: 'purple', 3: 'blue', 4: 'red', 5: 'teal', 6: 'blue', 7: 'red', 8: 'red', 9: 'orange'}, 'Pooled 1': {0: 0.0643791550056838, 1: 0.05022103288830682, 2: 0.039223739393748916, 3: 0.036366693834970217, 4: 0.05772708899447428, 5: 0.05969899935101172, 6: 0.04568101605219955, 7: 0.04542272002937567, 8: 0.07138013872431757, 9: 0.029987722053015278}})
fig, ax = plt.subplots(1, 1, figsize=(14, 5))
tmp_df = df.sort_values('Class')
bars = ax.bar(tmp_df['tic'], tmp_df['Pooled 1'], color=tmp_df['Color'])
prev = None
for cl, color, bar in zip(tmp_df['Class'], tmp_df['Color'], bars):
if cl != prev:
bar.set_label(cl)
prev = cl
ax.margins(x=0.01)
ax.legend(title='Class', bbox_to_anchor=(1.01, 1.01), loc='upper left')
plt.tight_layout()
plt.show()
PS: Note that you could also use Seaborn and let the coloring go automatic:
import seaborn as sns
sns.barplot(data=tmp_df, x='tic', y='Pooled 1', hue='Class', palette='tab10', dodge=False, saturation=1, ax=ax)

Python - How can I plot a line graph properly with a dictionary?

I am trying to plot a line graph to show the trends of each key of a dictionary in Jupyter Notebook with Python. This is what I have in the k_rmse_values variable as shown below:
k_rmse_values =
{'bore': {1: 8423.759328233446,
3: 6501.928933614838,
5: 6807.187615513473,
7: 6900.29659028346,
9: 7134.8868708101645},
'city-mpg': {1: 4265.365592771621,
3: 3865.0178306330113,
5: 3720.409335758634,
7: 3819.183283405616,
9: 4219.677972675927},
'compression-rate': {1: 7016.906657495168,
3: 7319.354017489066,
5: 6301.624922763969,
7: 6133.006310754547,
9: 6417.253959732598},
'curb-weight': {1: 3950.9888180049306,
3: 4201.343428000144,
5: 4047.052502155118,
7: 3842.0974736649846,
9: 3943.9478256384205},
'engine-size': {1: 2853.7338453331627,
3: 2793.6254775629623,
5: 3123.320055069605,
7: 2941.73029681235,
9: 2931.996240628853},
'height': {1: 6330.178232877807,
3: 7049.500497198366,
5: 6869.570862695864,
7: 6738.641089739572,
9: 6344.062937760911},
'highway-mpg': {1: 4826.0580187146525,
3: 3510.253629329685,
5: 3379.2250123364083,
7: 4044.271135312068,
9: 4462.027046251678},
'horsepower': {1: 3623.6389886411143,
3: 4294.825669466819,
5: 4778.254807521257,
7: 4730.538701514935,
9: 4662.8601512508885},
'length': {1: 4952.798701744297,
3: 5403.624431188139,
5: 5500.731909846179,
7: 5103.4515274528885,
9: 4471.077661709427},
'normalized-losses': {1: 9604.929081466453,
3: 7494.820436511842,
5: 6391.912634697067,
7: 6699.853883298577,
9: 6861.6389834002875},
'peak-rpm': {1: 8041.2366213164005,
3: 7502.080095843049,
5: 6521.863037752326,
7: 6869.602542315512,
9: 6884.533017667794},
'stroke': {1: 10330.231237489314,
3: 8947.585146097614,
5: 6973.912792744113,
7: 7266.333478250421,
9: 7026.017456146411},
'wheel-base': {1: 2797.4144312203725,
3: 3392.8627620671928,
5: 4238.25624378706,
7: 4456.687059524217,
9: 4426.032222634904},
'width': {1: 2849.2691940215127,
3: 4076.59327053035,
5: 3979.9751617315405,
7: 3845.3326184519606,
9: 3687.926625900343}}
When I used this code to plot
for k,v in k_rmse_values.items():
x = list(v.keys())
y = list(v.values())
plt.plot(x,y)
plt.xlabel('k value')
plt.ylabel('RMSE')
and it doesn't plot from 1 to 9 in order; it gives this graph
it plots in this k-value order 1, 3, 9 , 5, 7
I have spent hours on this problem and still can't figure out a way to do it. Your help with this would be greatly appreciated.
One solution is to sort the keys and get the matching values:
for k,v in k_rmse_values.items():
xs = list(v.keys()).sort()
ys = [v[x] for x in xs]
# Note I renamed these arrays so following uses should be changed accordingly

Categories

Resources