I have a dataset that maps continuous values to discrete categories. I want to display a histogram with the continuous values as x and categories as y, where bars are stacked and normalized. Example:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
df = pd.DataFrame({
'score' : np.random.rand(1000),
'category' : np.random.choice(list('ABCD'), 1000)
},
columns=['score', 'category'])
print df.head(10)
Output:
score category
0 0.649371 B
1 0.042309 B
2 0.689487 A
3 0.433064 B
4 0.978859 A
5 0.789140 C
6 0.215758 D
7 0.922389 B
8 0.105364 D
9 0.010274 C
If I try to plot this as a histogram using df.hist(by='category'), I get 4 graphs:
I managed to get the graph I wanted but I had to do a lot of manipulation.
# One column per category, 1 if maps to category, 0 otherwise
df2 = pd.DataFrame({
'score' : df.score,
'A' : (df.category == 'A').astype(float),
'B' : (df.category == 'B').astype(float),
'C' : (df.category == 'C').astype(float),
'D' : (df.category == 'D').astype(float)
},
columns=['score', 'A', 'B', 'C', 'D'])
# select "bins" of .1 width, and sum for each category
df3 = pd.DataFrame([df2[(df2.score >= (n/10.0)) & (df2.score < ((n+1)/10.0))].iloc[:, 1:].sum() for n in range(10)])
# Sum over series for weights
df4 = df3.sum(1)
bars = pd.DataFrame(df3.values / np.tile(df4.values, [4, 1]).transpose(), columns=list('ABCD'))
bars.plot.bar(stacked=True)
I expect there is a more straightforward way to do this, easier to read and understand and more optimized with less intermediate steps. Any solutions?
I dont know if this is really that much more compact or readable than what you already got but it is a suggestion (a late one as such :)).
import numpy as np
import pandas as pd
df = pd.DataFrame({
'score' : np.random.rand(1000),
'category' : np.random.choice(list('ABCD'), 1000)
}, columns=['score', 'category'])
# Set the range of the score as a category using pd.cut
df.set_index(pd.cut(df['score'], np.linspace(0, 1, 11)), inplace=True)
# Count all entries for all scores and all categories
a = df.groupby([df.index, 'category']).size()
# Normalize
b = df.groupby(df.index)['category'].count()
df_a = a.div(b, axis=0,level=0)
# Plot
df_a.unstack().plot.bar(stacked=True)
Consider assigning bins with cut, calculating grouping percentages with couple of groupby().transform calls, and then aggregate and reshape with pivot_table:
# CREATE BIN INDICATORS
df['plot_bins'] = pd.cut(df['score'], bins=np.arange(0,1.1,0.1),
labels=np.arange(0,1,0.1)).round(1)
# CALCULATE PCT OF CATEGORY OUT OF BINs
df['pct'] = (df.groupby(['plot_bins', 'category'])['score'].transform('count')
.div(df.groupby(['plot_bins'])['score'].transform('count')))
# PIVOT TO AGGREGATE + RESHAPE
agg_df = (df.pivot_table(index='plot_bins', columns='category', values='pct', aggfunc='max')
.reset_index(drop=True))
# PLOT
agg_df.plot(kind='bar', stacked=True, rot=0)
Related
I want to plot bar graph from the dataframe below.
df2 = pd.DataFrame({'URL': ['A','B','C','D','E','F'],
'X': [5,0,7,1,0,6],
'Y': [21,0,4,7,9,0],
'Z':[11,0,8,4,0,0]})
URL X Y Z
0 A 5 21 11
1 B 0 0 0
2 C 7 4 8
3 D 1 7 4
4 E 0 9 0
5 F 6 0 0
I want to plot bar graph in which I have URL counts on y-axis and X , Y, Z on x-axis with two bars for each. One bar will show the total sum of all the numbers in the respective column while another bar will show number of non zero values in column. The image of bar graph should look like this. If anyone can help me in this case. Thank you
You can use:
(df2
.reset_index()
.melt(id_vars=['index', 'URL'])
.assign(category=lambda d: np.where(d['value'].eq(0), 'Z', 'NZ'))
.pivot(['index', 'URL', 'variable'], 'category', 'value')
.groupby('variable')
.agg(**{'sum(non-zero)': ('NZ', 'sum'), 'count(zero)': ('Z', 'count')})
.plot.bar()
)
output:
df2.melt("URL").\
groupby("variable").\
agg(sums=("value", "sum"),
nz=("value", lambda x: sum(x != 0))).\
plot(kind="bar")
Try:
import pandas as pd
import matplotlib.pyplot as plt
df2 = pd.DataFrame({'URL': ['A','B','C','D','E','F'],
'X': [5,0,7,1,0,6],
'Y': [21,0,4,7,9,0],
'Z':[11,0,8,4,0,0]})
df2_ = df2[["X", "Y", "Z"]]
sums = df2_.sum().to_frame(name="sums")
nonzero_count = (~(df2_==0)).sum().to_frame(name="count_non_zero")
pd.concat([sums,nonzero_count], axis=1).plot.bar()
plt.show()
I didn't find a complete answer to what i want to do:
I have a dataframe. I want to group by user and their answers to a survey, sum all of their good answers/total of their answers, display it in % and plot the result.
I have an answer column which contains : 1,0 or -1. I want to filter it in order to exclude -1.
Here is what i did so far :
df_sample.groupby('user').filter(lambda x : x['answer'].mean() >-1)
or :
a = df_sample.loc[df_sample['answer']!=-1,['user','answer']]
b = a.groupby(['user','answer']).agg({'answer' : 'sum'})
See it's uncomplete. Thank you for any suggestion that you may have.
Let's try with some sample data:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
n = 100
df = pd.DataFrame({'user': np.random.choice(list("ABCD"), size=n),
'answer': np.random.choice([1, 0, -1], size=n)})
df.head():
user answer
0 D 1
1 C 0
2 D -1
3 B 1
4 C 1
Option 1
Filter out the -1 values and use named aggregation to get the "good answers" and "total answers":
plot_df = df[df['answer'].ne(-1)].groupby('user').aggregate(
good_answer=('answer', 'sum'),
total_answer=('answer', 'size')
)
plot_df:
good_answer total_answer
user
A 9 15
B 11 20
C 15 19
D 7 14
Use division and multiplication to get percentage:
plot_df['pct'] = (plot_df['good_answer'] / plot_df['total_answer'] * 100)
plot_df:
good_answer total_answer pct
user
A 9 15 60.000000
B 11 20 55.000000
C 15 19 78.947368
D 7 14 50.000000
Then this can be plotted with DataFrame.plot:
ax = plot_df.plot(
y='pct', kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
Option 2
If just the percentage is needed groupby mean can be used to get to the resulting plot directly after filtering out the -1s:
plot_df = df[df['answer'].ne(-1)].groupby('user')['answer'].mean().mul(100)
ax = plot_df.plot(
kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
plot_df:
answer
user
A 60.000000
B 55.000000
C 78.947368
D 50.000000
Both options Produce:
All together:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
np.random.seed(5)
n = 100
df = pd.DataFrame({'user': np.random.choice(list("ABCD"), size=n),
'answer': np.random.choice([1, 0, -1], size=n)})
plot_df = df[df['answer'].ne(-1)].groupby('user').aggregate(
good_answer=('answer', 'sum'),
total_answer=('answer', 'size')
)
plot_df['pct'] = (plot_df['good_answer'] / plot_df['total_answer'] * 100)
ax = plot_df.plot(
y='pct', kind='bar', rot=0,
title='Percentage of Good Answers',
ylim=[0, 100],
label='Percent Good'
)
# Add Labels on Top of Bars
for container in ax.containers:
ax.bar_label(container, fmt='%.2f%%')
plt.show()
Here is a sample solution assuming you want to calculate percentage based on filtered dataframe.
import pandas as pd
import numpy as np
df_sample = pd.DataFrame(np.random.randint(-1,2,size=(10, 1)), columns=['answer'])
df_sample['user'] = [i for i in 'a b c d e f a b c d'.split(' ')]
df_filtered = df_sample[df_sample.answer>-1]
print(df_filtered.groupby('user').agg({'answer' : lambda x: x.sum()/len(df_filtered)*100}))
I have two temporal sequences, let's say tweets per time of the day, for two different categories. I would like to plot them as a joyplot with as x axis the time of the day and as height the number of tweets.
My code is:
from joypy import joyplot
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame(index = range(24 * 2))
df['value'] = list([int(10 * aa) for aa in np.random.rand(24 * 2)])
df.loc[23 : 30, 'value'] = 0
df['hour'] = [aa for aa in range(24)] * 2
df['cat'] = ['A'] * 24 + ['B'] * 24
joyplot(df, by = 'cat', column = 'hour')
so df.head(3) will return:
value hour cat
0 5 0 A
1 7 1 A
2 6 2 A
I am expecting to see the variations of value in hour but I do get two flat plots
Plotting the two lines (to have an idea of the expected plot)
dfa = df[df.cat == 'A']
dfb = df[df.cat == 'B']
plt.figure()
plt.plot(dfa.hour, dfa.value)
plt.plot(dfb.hour, dfb.value)
I obtain:
How should I fix this?
Using a numpy random number generator, generate arrays on height and weight of the 88,000 people living in Utah.
The average height is 1.75 metres and the average weight is 70kg. Assume standard deviation on 3.
Combine these two arrays using column_stack method and convert it into a pandas DataFrame with the first column named as 'height' and the second column named as 'weight'
I've gotten the randomly generated data. However, I can't seem to convert the array to a DataFrame
import numpy as np
import pandas as pd
height = np.round(np.random.normal(1.75, 3, 88000), 2)
weight = np.round(np.random.normal(70, 3, 88000), 2)
np_height = np.array(height)
np_weight = np.array(weight)
Utah = np.round(np.column_stack((np_height, np_weight)), 2)
print(Utah)
df = pd.DataFrame(
[[np_height],
[np_weight]],
index = [0, 1],
columns = ['height', 'weight'])
print(df)
You want 2 columns, yet you passed data [[np_height],[np_weight]] as 1 column. You can set the data as dict.
df = pd.DataFrame({'height':np_height,
'weight':np_weight},
columns = ['height', 'weight'])
print(df)
The data in Utah is already in a suitable shape. Why not use that?
import numpy as np
import pandas as pd
height = np.round(np.random.normal(1.75, 3, 88000), 2)
weight = np.round(np.random.normal(70, 3, 88000), 2)
np_height = np.array(height)
np_weight = np.array(weight)
Utah = np.round(np.column_stack((np_height, np_weight)), 2)
df = pd.DataFrame(
data=Utah,
columns=['height', 'weight']
)
print(df.head())
height weight
0 3.57 65.32
1 -0.15 66.22
2 5.65 73.11
3 2.00 69.59
4 2.67 64.95
I want to remove outliers based on percentile 99 values by group wise.
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','B','B','B','B'], 'count': [1.1,11.2,1.1,3.3,3.40,3.3,100.0]})
in output i want to remove 11.2 from group A and 100 from group b. so in final dataset there will only be 5 observations.
wantdf = pd.DataFrame({'Group': ['A','A','B','B','B'], 'count': [1.1,1.1,3.3,3.40,3.3]})
I have tried this one but I'm not getting the desired results
df[df.groupby("Group")['count'].transform(lambda x : (x<x.quantile(0.99))&(x>(x.quantile(0.01)))).eq(1)]
Here is my solution:
def is_outlier(s):
lower_limit = s.mean() - (s.std() * 3)
upper_limit = s.mean() + (s.std() * 3)
return ~s.between(lower_limit, upper_limit)
df = df[~df.groupby('Group')['count'].apply(is_outlier)]
You can write your own is_outlier function
I don't think you want to use quantile, as you'll exclude your lower values:
import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','B','B','B','B'], 'count': [1.1,11.2,1.1,3.3,3.40,3.3,100.0]})
print(pd.DataFrame(df.groupby('Group').quantile(.01)['count']))
output:
count
Group
A 1.1
B 3.3
Those aren't outliers, right? So you wouldn't want to exclude them.
You could try setting left and right limits by using standard deviations from the median maybe? This is a bit verbose, but it gives you the right answer:
left = pd.DataFrame(df.groupby('Group').median() - pd.DataFrame(df.groupby('Group').std()))
right = pd.DataFrame(df.groupby('Group').median() + pd.DataFrame(df.groupby('Group').std()))
left.columns = ['left']
right.columns = ['right']
df = df.merge(left, left_on='Group', right_index=True)
df = df.merge(right, left_on='Group', right_index=True)
df = df[(df['count'] > df['left']) & (df['count'] < df['right'])]
df = df.drop(['left', 'right'], axis=1)
print(df)
output:
Group count
0 A 1.1
2 A 1.1
3 B 3.3
4 B 3.4
5 B 3.3