I have a data set that looks like the following:
student question answer number
Bob How many donuts in a dozen? A 1
Sally How many donuts in a dozen? C 1
Edward How many donuts in a dozen? A 1
....
Edward What colour is the sky? C 1
Marvin What colour is the sky? D 1
From which I wrote some code that generates a pivot table to total up the results of a test, like so:
data = pd.pivot_table(df,index=['question'],columns = ['answer'],aggfunc='count',fill_value = 0)
number
answer A B C D
question
How many donuts in a dozen? 1 4 3 2
What colour is the sky? 1 9 0 0
From there I'm creating a heatmap from the pivot table for visualization purposes. Generally this works. However, if for some reason there are no students in the selected set who have chosen one of the answers (say, no one selected "D" for any questions) then that column doesn't show up in the heatmap; the column is left off.
How can I ensure that all the required columns display in the heatmap, even if no one selected that answer?
I think an even simpler approach would be to add 'dropna = False' to the pivot table parameters, default behavior is set to 'True'. This worked for me in a similar situation with time series data that contained large swaths of days with NaNs.
pd.pivot_table(dropna = False)
You can take all possible answers and reindex your result. For example, in the small sample you have provided, no student selected B. Let's say your options are A, B, C, D:
answers = [*'ABCD']
res = df.pivot_table(
index='question',
columns='answer',
values='number',
aggfunc='sum',
fill_value=0
).reindex(answers, axis=1, fill_value=0)
answer A B C D
question
How many donuts in a dozen? 2 0 1 0
What colour is the sky? 0 0 1 1
The corresponding heatmap:
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(res, annot=True)
plt.tight_layout()
plt.show()
Related
I am new in Python and am struggling with Pandas. More specifically I have a column (Sensory scores) in a dataframe that consists of multiple words like this:
*Treatment* *Sensory scores*
A soft, short
B soft, tender
C short, tender
Now I want to add extra columns "soft', 'short' and ' tender' to the dataframe whereby the individual scores are extracted and quantified like this:
*Treatment* *Sensory scores* *soft* *short* *tender*
A soft, short 1 1 0
B soft, tender 1 0 1
C short, tender 0 1 1
What is the best way to program this in Pandas? Any help, suggestions are appreciated. Many thanks in advance.
Coen
You need first to split the values, then you can use pivot_table to sum a dummy column (count):
df = df.set_index("*Treatment*")
df = pd.DataFrame(df["*Sensory scores*"].str.split(', ').explode())
df["count"] = 1
df = df.pivot_table(index=df.index, columns="*Sensory scores*", fill_value=0)
Selecting Multiple Explicit Rows and Column to then Calculate Mean
Hi Everyone thank you for taking a look at this question. I'm working with a pokémon data set and looking to select explicit rows and a column to then calculate the mean for that value.
The Columns worked with are type_1 , generation and total_points.
The Row values are Grass and 1
Grass corresponds to the type and 1 to the generation.
grass_total_points = pokedex.loc[pokedex.type_1 == 'Grass', ['total_points']].mean()
This code above works and returns the total mean for all grass types across all 8 generations but I would like to retrieve them on a generation by generation basis.
gen_1 = pokedex.loc[pokedex['generation'] == '1' & pokedex['type_1'] == 'Grass', ['total_points']].mean()
I attempted the code above, with no luck I searched around and cannot find any answers to this.
Jeremy, try the groupby method:
import pandas as pd
pokedex = pd.DataFrame({'Type':['Grass','Grass','Grass','Sand','Sand'],
'Generation':[1,2,1,1,1],"TotalPoints":[50,10,20,30,40]})
pokedex.groupby(['Type','Generation'])['TotalPoints'].mean()
Should return:
Type Generation
Grass 1 35
2 10
Sand 1 35
Name: TotalPoints, dtype: int64
I have a DataFrame(df) in the below format. I want to create multi-line chart using this data.
Name Category Score Count
A Books 12025.4375 48
A Music 17893.25 4
A Movie 31796.37838 37
A Script 1560.4 5
A Art 973.125 8
B Music 1929 15
B Movie 3044.229167 48
B Script 3402.4 10
B Art 2450.125 8
C Books 14469.3 10
C Music 10488.78947 57
C Movie 1827.101695 59
C Script 7077 2
Expected Output:
I want unique Category at X-Axis.
Score at Y-Axis and Multiple lines representing multiple Name.
Count is just an additional data which is not needed for this graph.
I tried using the below syntax, which is not showing the output in expected format.
lines = df.line(x= 'Category',\
y=['Name','Score'],figsize=(20,10))
I tried multiple options and answers available here but seems like nothing is working for me.
First pivot data and then plot by DataFrame.plot, line is default value so should be omitted:
import matplotlib.pyplot as plt
df1 = df.pivot('Category','Name','Score')
df1.plot(figsize=(20,10))
#show values in x axis
plt.xticks(np.arange(len(df1.index)), df1.index)
plt.show()
I have a dataset
a b c d
10-Apr-86 Jimmy 1 this is
11-Apr-86 Minnie 2 the way
12-Apr-86 Jimmy 3 the world
13-Apr-86 Minnie 4 ends
14-Apr-86 Jimmy 5 this is the
15-Apr-86 Eliot 6 way
16-Apr-86 Jimmy 7 the world ends
17-Apr-86 Eliot 8 not with a bang
18-Apr-86 Minnie 9 but a whimper
I want to make a chart in matplotlib that looks like this
I've figure out how to get just the dots (no annotations) using the following code:
df = (pd.read_csv('python.csv'))
df_wanted = pd.pivot_table(
df,
index='a',
columns='b',
values='c')
df_wanted.index = pd.to_datetime(df_wanted.index)
plt.scatter(df_wanted.index, df_wanted['Jimmy'])
plt.scatter(df_wanted.index,df_wanted['Minnie'])
plt.scatter(df_wanted.index,df_wanted['Eliot'])
I think that to annotate, I need a list of values (as demonstrated here ) on the final column of my pivot table
My problem is: how do I get that final column 'd' of the original dataset to become the final column of my pivot table?
I tried dat1 = pd.concat([df_wanted, df['d']], axis = 1) - but this created a new set of rows underneath the rows of my dataframe. I realized the axis wasn't the same, so I tried to make a new pivot table with the d column as values - but got the error message No numeric types to aggregate.
I tried df_wanted2.append(df['d']) - but this made a new column for every element in column d.
Any advice? Ultimately, I want to make it so the data labels appear when one rolls over the point with the mouse
In this specific case, it doesn't seem you need to set column d as the final column of your pivot table.
plt.scatter(df_wanted.index, df_wanted['Jimmy'])
plt.scatter(df_wanted.index,df_wanted['Minnie'])
plt.scatter(df_wanted.index,df_wanted['Eliot'])
plt.legend(loc=0)
for k, v in df.set_index('a').iterrows():
plt.text(k, v['c'], v['d']) # or: plt.annotate(xy=(k, v['c']), s=v['d'])
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I have a large data set that contains 3 attributes per row: A,B,C
Column A: can take the values 1, 2, and 0.
Column B and C: can take any values.
I'd like to perform density estimation using histograms for P(A = 2 | B,C) and plot the results using python.
I do not need the code to do it, I can try and figure that on my own. I just need to know the procedures and the tools that should I use?
To answer your over-all question, we should go through different steps and answer different questions:
How to read csv file (or text data) ?
How to filter data ?
How to plot data ?
At each stage, you need to use some techniques and specific tools, you might also have different choices at different stages (You can look on the internet for different alternatives).
1- How to read csv file:
There is a built-in function to go through the csv file where you store your data. But most people recommend Pandas to deal with csv files.
After installing Pandas package, you can read your csv file using Read_CSV command.
import pandas as pd
df= pd.read_csv("file.csv")
As you didn't share the csv file, I will make a random dataset to explain the up-coming steps.
import pandas as pd
import numpy as np
t= [1,1,1,2,0,1,1,0,0,2,1,1,2,0,0,0,0,1,1,1]
df = pd.DataFrame(np.random.randn(20, 2), columns=list('AC'))
df['B']=t #put a random column with only 0,1,2 values, then insert it to the dataframe
Note: Numpy is a python-Package. It's helpful to work with mathematical operations. You don't primarily need it, but I mentioned it to clear confusion here.
In case you print df in this case, you will get as result:
A C B
0 -0.090162 0.035458 1
1 2.068328 -0.357626 1
2 -0.476045 -1.217848 1
3 -0.405150 -1.111787 2
4 0.502283 1.586743 0
5 1.822558 -0.398833 1
6 0.367663 0.305023 1
7 2.731756 0.563161 0
8 2.096459 1.323511 0
9 1.386778 -1.774599 2
10 -0.512147 -0.677339 1
11 -0.091165 0.587496 1
12 -0.264265 1.216617 2
13 1.731371 -0.906727 0
14 0.969974 1.305460 0
15 -0.795679 -0.707238 0
16 0.274473 1.842542 0
17 0.771794 -1.726273 1
18 0.126508 -0.206365 1
19 0.622025 -0.322115 1
2- - How to filter data:
There are different techniques to filter data. The easiest one is by selecting the name of column inside your dataframe + the condition. In our case, the criteria is selecting value "2" in column B.
l= df[df['B']==2]
print l
You can also use other ways such groupby, lambda to go through the data frame and apply different conditions to filter the data.
for key in df.groupby('B'):
print key
If you run the above-mentioned scripts you'll get:
For the first one: Only data where B==2
A C B
3 -0.405150 -1.111787 2
9 1.386778 -1.774599 2
12 -0.264265 1.216617 2
For the second one: Printing the results divided in groups.
(0, A C B
4 0.502283 1.586743 0
7 2.731756 0.563161 0
8 2.096459 1.323511 0
13 1.731371 -0.906727 0
14 0.969974 1.305460 0
15 -0.795679 -0.707238 0
16 0.274473 1.842542 0)
(1, A C B
0 -0.090162 0.035458 1
1 2.068328 -0.357626 1
2 -0.476045 -1.217848 1
5 1.822558 -0.398833 1
6 0.367663 0.305023 1
10 -0.512147 -0.677339 1
11 -0.091165 0.587496 1
17 0.771794 -1.726273 1
18 0.126508 -0.206365 1
19 0.622025 -0.322115 1)
(2, A C B
3 -0.405150 -1.111787 2
9 1.386778 -1.774599 2
12 -0.264265 1.216617 2)
How to plot your data:
The simplest ways to plot your data is by using matplotlib
The easiest ways to plot data in columns B, is by running :
import random
import matplotlib.pyplot as plt
xbins=range(0,len(l))
plt.hist(df.B, bins=20, color='blue')
plt.show()
You'll get this result:
if you wanna plot the results combined, you should use different colors/techniques to make it useful.
import numpy as np
import matplotlib.pyplot as plt
a = df.A
b = df.B
c = df.C
t= range(20)
plt.plot(t, a, 'r--', b, 'bs--', c, 'g^--')
plt.legend()
plt.show()
You'll get as a result:
Plotting data is driven by a specific need. You can explore the different ways to plot data by going through the examples of marplotlib.org official website.
If you're looking for other tools that do slightly more sophisticated things than nonparametric density estimation with histograms, please check this link to the python repository or directly install the package with
pip install cde
In addition to an extensive documentation, the package implements
Nonparametric (conditional & neighborhood kernel density estimation)
semiparametric (least squares cde) and
parametric neural network-based methods (mixture density networks, kernel density estimation)
Also, the package allows to compute centered moments, statistical divergences (kl-divergence, hellinger, jensen-shannon), percentiles, expected shortfalls and data generating processes (arma-jump, jump-diffusion, GMMs etc.)
Disclaimer: I am one of the package developers.