.groupby function in Python - python

I am trying to create a pie chart in Python. I have a dataset with 2137 responses to a question, with answer choices ranging from 1 to 5. I am trying to produce a pie chart with the percentage of responses for each answer choice, but when I run my code, it produce a pie chart of each respondent (so 2137 pieces of the pie. I am thinking I need to use the .groupby function, but I am not entirely sure on how to correctly do it.
df3 = pd.DataFrame(df, columns=['Q78']).groupby(['Q78'])
df3.plot.pie(subplots=True)
Here is what I have tried. (PS I am just starting to learn Python, so sorry if this is a dumb question!!)

One of possible solutions:
s = df.Q87
s.groupby(s).size().plot.pie(autopct='%1.1f%%');
To test my code I created a Datarame limited to just 8 answers:
Q87
0 A
1 B
2 C
3 D
4 E
5 A
6 B
7 A
and I got the following picture:

Related

How to create year-over-year plots in python

I have the following df
player season pts
A 2017 6
A 2018 5
A 2019 9
B 2017 2
B 2018 1
B 2019 3
C 2017 10
C 2018 8
C 2019 7
I would like to make a plot to look at the stability of pts year-over-year. That is, I want to see how correlated pts are on a year-to year-basis. I have tried various ways to plot this, but can't seem to get it quite right. Here is what I tried initially:
fig, ax = plt.subplots(figsize=(15,10))
for i in df.season:
sns.scatterplot(df.pts.iloc[i],df.pts.iloc[i]+1)
plt.xlabel('WOPR Year n')
plt.ylabel('WOPR Year n+1')
IndexError: single positional indexer is out-of-bounds
I thought about it some more, and thought something like this may work:
fig, ax = plt.subplots(figsize=(15,10))
seasons = [2017,2018,2019]
for i in seasons:
sns.scatterplot(df.pts.loc[df.season==i],df.pts.loc[df.season==i+1])
plt.xlabel('WOPR Year n')
plt.ylabel('WOPR Year n+1')
This didn't return an error, but just gave me a blank plot. I think I am close here. Any help is appreciated. Thanks! To clarify, I want each player to be plotted twice. Once for x=2017 and y=2018, and another for x=2018 and y=2019 (hence the year n+1). EDIT: a sns.regplot() would probably be better here compared to sns.scatterplot as I could leverage the trendline to my liking. The below image captures the stability of the desired metric from year to year.
I think you can do a to do a self-merge:
sns.lineplot(data=df.merge(df.assign(season=df.season+1),
on=['player','season'],
suffixes=['_last','_current']),
x='pts_last', y='pts_current', hue='player')
Output:
Note: If you don't care for players, then you could drop hue. Also, use scatterplot instead of lineplot if it fits you better.
Based on your second idea:
for i in seasons[:-1]:
sns.scatterplot(df.pts.loc[df.season==i].tolist(),df.pts.loc[df.season==(i+1)].tolist())
It seems there were two issues: one is that the Seaborn method expect numerical data; converting the series to a list gets rid of the index so that Seaborn handles it properly. The other is that you need to exclude the last element of seasons, since you're plotting n against n+1.

Missing values in Pandas Pivot table?

I have a data set that looks like the following:
student question answer number
Bob How many donuts in a dozen? A 1
Sally How many donuts in a dozen? C 1
Edward How many donuts in a dozen? A 1
....
Edward What colour is the sky? C 1
Marvin What colour is the sky? D 1
From which I wrote some code that generates a pivot table to total up the results of a test, like so:
data = pd.pivot_table(df,index=['question'],columns = ['answer'],aggfunc='count',fill_value = 0)
number
answer A B C D
question
How many donuts in a dozen? 1 4 3 2
What colour is the sky? 1 9 0 0
From there I'm creating a heatmap from the pivot table for visualization purposes. Generally this works. However, if for some reason there are no students in the selected set who have chosen one of the answers (say, no one selected "D" for any questions) then that column doesn't show up in the heatmap; the column is left off.
How can I ensure that all the required columns display in the heatmap, even if no one selected that answer?
I think an even simpler approach would be to add 'dropna = False' to the pivot table parameters, default behavior is set to 'True'. This worked for me in a similar situation with time series data that contained large swaths of days with NaNs.
pd.pivot_table(dropna = False)
You can take all possible answers and reindex your result. For example, in the small sample you have provided, no student selected B. Let's say your options are A, B, C, D:
answers = [*'ABCD']
res = df.pivot_table(
index='question',
columns='answer',
values='number',
aggfunc='sum',
fill_value=0
).reindex(answers, axis=1, fill_value=0)
answer A B C D
question
How many donuts in a dozen? 2 0 1 0
What colour is the sky? 0 0 1 1
The corresponding heatmap:
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(res, annot=True)
plt.tight_layout()
plt.show()

How to create multi-line chart with duplicate index categorical data?

I have a DataFrame(df) in the below format. I want to create multi-line chart using this data.
Name Category Score Count
A Books 12025.4375 48
A Music 17893.25 4
A Movie 31796.37838 37
A Script 1560.4 5
A Art 973.125 8
B Music 1929 15
B Movie 3044.229167 48
B Script 3402.4 10
B Art 2450.125 8
C Books 14469.3 10
C Music 10488.78947 57
C Movie 1827.101695 59
C Script 7077 2
Expected Output:
I want unique Category at X-Axis.
Score at Y-Axis and Multiple lines representing multiple Name.
Count is just an additional data which is not needed for this graph.
I tried using the below syntax, which is not showing the output in expected format.
lines = df.line(x= 'Category',\
y=['Name','Score'],figsize=(20,10))
I tried multiple options and answers available here but seems like nothing is working for me.
First pivot data and then plot by DataFrame.plot, line is default value so should be omitted:
import matplotlib.pyplot as plt
df1 = df.pivot('Category','Name','Score')
df1.plot(figsize=(20,10))
#show values in x axis
plt.xticks(np.arange(len(df1.index)), df1.index)
plt.show()

plotting stacked bar graph on column values

I have a Pandas data frame that looks like this:
ID Management Administrative
1 1 2
3 2 1
4 3 3
10 1 3
essentially the 1-3 is a grade of low medium or high. I want a stacked bar chart that has Management and Administrative on x-axis and the stacked composition of 1,2,3 of each column in percentages.
e.g. if there were only 4 entries as above, 1 would compose 50% of the height, 2 would compose 25% and 3 would compose 25% of the height of the management bar. The y axis would go up to 100%.
Hope this makes sense. Hard to explain but if unclear willing to clarify further!
You will need to chain several operations: First melt your dataset to move the Department as a new variable, after that you can groupby the Department and the Rating to count the number of IDs that fall into that bucket, then you groupby again by Department to calculate the percentages. Lastly you can plot your stacked bar graph:
df4.melt().rename(columns={'variable':'Dept', 'value':'Rating'}
).query('Dept!="ID"'
).groupby(['Dept','Rating']).size(
).rename('Count'
).groupby(level=0).apply(lambda x: x/sum(x)
).unstack().plot(kind='bar', stacked=True)

Tools to use for conditional density estimation in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I have a large data set that contains 3 attributes per row: A,B,C
Column A: can take the values 1, 2, and 0.
Column B and C: can take any values.
I'd like to perform density estimation using histograms for P(A = 2 | B,C) and plot the results using python.
I do not need the code to do it, I can try and figure that on my own. I just need to know the procedures and the tools that should I use?
To answer your over-all question, we should go through different steps and answer different questions:
How to read csv file (or text data) ?
How to filter data ?
How to plot data ?
At each stage, you need to use some techniques and specific tools, you might also have different choices at different stages (You can look on the internet for different alternatives).
1- How to read csv file:
There is a built-in function to go through the csv file where you store your data. But most people recommend Pandas to deal with csv files.
After installing Pandas package, you can read your csv file using Read_CSV command.
import pandas as pd
df= pd.read_csv("file.csv")
As you didn't share the csv file, I will make a random dataset to explain the up-coming steps.
import pandas as pd
import numpy as np
t= [1,1,1,2,0,1,1,0,0,2,1,1,2,0,0,0,0,1,1,1]
df = pd.DataFrame(np.random.randn(20, 2), columns=list('AC'))
df['B']=t #put a random column with only 0,1,2 values, then insert it to the dataframe
Note: Numpy is a python-Package. It's helpful to work with mathematical operations. You don't primarily need it, but I mentioned it to clear confusion here.
In case you print df in this case, you will get as result:
A C B
0 -0.090162 0.035458 1
1 2.068328 -0.357626 1
2 -0.476045 -1.217848 1
3 -0.405150 -1.111787 2
4 0.502283 1.586743 0
5 1.822558 -0.398833 1
6 0.367663 0.305023 1
7 2.731756 0.563161 0
8 2.096459 1.323511 0
9 1.386778 -1.774599 2
10 -0.512147 -0.677339 1
11 -0.091165 0.587496 1
12 -0.264265 1.216617 2
13 1.731371 -0.906727 0
14 0.969974 1.305460 0
15 -0.795679 -0.707238 0
16 0.274473 1.842542 0
17 0.771794 -1.726273 1
18 0.126508 -0.206365 1
19 0.622025 -0.322115 1
2- - How to filter data:
There are different techniques to filter data. The easiest one is by selecting the name of column inside your dataframe + the condition. In our case, the criteria is selecting value "2" in column B.
l= df[df['B']==2]
print l
You can also use other ways such groupby, lambda to go through the data frame and apply different conditions to filter the data.
for key in df.groupby('B'):
print key
If you run the above-mentioned scripts you'll get:
For the first one: Only data where B==2
A C B
3 -0.405150 -1.111787 2
9 1.386778 -1.774599 2
12 -0.264265 1.216617 2
For the second one: Printing the results divided in groups.
(0, A C B
4 0.502283 1.586743 0
7 2.731756 0.563161 0
8 2.096459 1.323511 0
13 1.731371 -0.906727 0
14 0.969974 1.305460 0
15 -0.795679 -0.707238 0
16 0.274473 1.842542 0)
(1, A C B
0 -0.090162 0.035458 1
1 2.068328 -0.357626 1
2 -0.476045 -1.217848 1
5 1.822558 -0.398833 1
6 0.367663 0.305023 1
10 -0.512147 -0.677339 1
11 -0.091165 0.587496 1
17 0.771794 -1.726273 1
18 0.126508 -0.206365 1
19 0.622025 -0.322115 1)
(2, A C B
3 -0.405150 -1.111787 2
9 1.386778 -1.774599 2
12 -0.264265 1.216617 2)
How to plot your data:
The simplest ways to plot your data is by using matplotlib
The easiest ways to plot data in columns B, is by running :
import random
import matplotlib.pyplot as plt
xbins=range(0,len(l))
plt.hist(df.B, bins=20, color='blue')
plt.show()
You'll get this result:
if you wanna plot the results combined, you should use different colors/techniques to make it useful.
import numpy as np
import matplotlib.pyplot as plt
a = df.A
b = df.B
c = df.C
t= range(20)
plt.plot(t, a, 'r--', b, 'bs--', c, 'g^--')
plt.legend()
plt.show()
You'll get as a result:
Plotting data is driven by a specific need. You can explore the different ways to plot data by going through the examples of marplotlib.org official website.
If you're looking for other tools that do slightly more sophisticated things than nonparametric density estimation with histograms, please check this link to the python repository or directly install the package with
pip install cde
In addition to an extensive documentation, the package implements
Nonparametric (conditional & neighborhood kernel density estimation)
semiparametric (least squares cde) and
parametric neural network-based methods (mixture density networks, kernel density estimation)
Also, the package allows to compute centered moments, statistical divergences (kl-divergence, hellinger, jensen-shannon), percentiles, expected shortfalls and data generating processes (arma-jump, jump-diffusion, GMMs etc.)
Disclaimer: I am one of the package developers.

Categories

Resources