I would like to access a table with the bin counts and intervals from the histogram resulting from the code snippet below. Is there any way to do this?
import altair as alt
from vega_datasets import data
movies = data.movies.url
alt.Chart(movies).mark_bar().encode(
alt.X("IMDB_Rating:Q", bin=True),
y='count()',
)
You can try using the altair_transform package for this, example from the readme:
import altair as alt
import pandas as pd
import numpy as np
from altair_transform import transform_chart
np.random.seed(12345)
df = pd.DataFrame({
'x': np.random.randn(20000)
})
chart = alt.Chart(df).mark_bar().encode(
alt.X('x', bin=True),
y='count()'
)
new_chart = transform_chart(chart)
new_chart.data
x_binned x_binned2 count
-4.0 -3.0 29
-3.0 -2.0 444
-2.0 -1.0 2703
-1.0 0.0 6815
0.0 1.0 6858
1.0 2.0 2706
2.0 3.0 423
3.0 4.0 22
Related
I am producing a pandas barplot with raw counts represented by the plot, however I would like to annotate the bars with the pct of those counts as a whole. I have seen a lot of people using ax.patches methods to annotate but my values are unrelated to the get_height of the actual bars.
Here is some toy data. The plot produced will be the individual counts of the specific type. However, I want to add annotations above that specific bar that represent the pct total of that specific type to all types for that person's name.
Let me know if you need any more clarification.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'ID': [1,1,1,2,2,3,3,3,4],
'name': ['bob','bob','bob','shelby','shelby','jordan','jordan','jordan','jeff'],
'type': ['type1','type2','type4','type1','type6','type5','type8','type2',None]}
df: pd.DataFrame = pd.DataFrame(data=d)
df_pivot: pd.DataFrame = df.pivot_table(index='type', columns=['name'], values='ID', aggfunc={'ID': np.sum}).fillna(0)
# create percent totals of the specific type's row of the total
df_pivot['bob_pct_total']: pd.Series = (df_pivot['bob']/df_pivot['bob'].sum()).mul(100).round(1)
df_pivot['shelby_pct_total']: pd.Series = (df_pivot['shelby']/df_pivot['shelby'].sum()).mul(100).round(1)
df_pivot['jordan_pct_total']: pd.Series = (df_pivot['jordan']/df_pivot['jordan'].sum()).mul(100).round(1)
df_pivot.head(10)
name bob jordan shelby bob_pct_total shelby_pct_total jordan_pct_total
type
type1 1.0 0.0 2.0 33.3 50.0 0.0
type2 1.0 3.0 0.0 33.3 0.0 33.3
type4 1.0 0.0 0.0 33.3 0.0 0.0
type5 0.0 3.0 0.0 0.0 0.0 33.3
type6 0.0 0.0 2.0 0.0 50.0 0.0
type8 0.0 3.0 0.0 0.0 0.0 33.3
fig, ax = plt.subplots(figsize=(15,15))
df_pivot.plot(kind='bar', y=['bob','jordan','shelby'], ax=ax)
You can use the old approach, looping through the bars, using the height to position whatever text you want. Since matplotlib 3.4.0 there also is a new function bar_label that removes much of the boilerplate:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
d = {'ID': [1, 1, 1, 2, 2, 3, 3, 3, 4],
'name': ['bob', 'bob', 'bob', 'shelby', 'shelby', 'jordan', 'jordan', 'jordan', 'jeff'],
'type': ['type1', 'type2', 'type4', 'type1', 'type6', 'type5', 'type8', 'type2', None]}
df: pd.DataFrame = pd.DataFrame(data=d)
df_pivot: pd.DataFrame = df.pivot_table(index='type', columns=['name'], values='ID', aggfunc={'ID': np.sum}).fillna(0)
# create percent totals of the specific type's row of the total
df_pivot['bob_pct_total']: pd.Series = (df_pivot['bob'] / df_pivot['bob'].sum()).mul(100).round(1)
df_pivot['shelby_pct_total']: pd.Series = (df_pivot['shelby'] / df_pivot['shelby'].sum()).mul(100).round(1)
df_pivot['jordan_pct_total']: pd.Series = (df_pivot['jordan'] / df_pivot['jordan'].sum()).mul(100).round(1)
fig, ax = plt.subplots(figsize=(12, 5))
columns = ['bob', 'jordan', 'shelby']
df_pivot.plot(kind='bar', y=['bob', 'jordan', 'shelby'], rot=0, ax=ax)
for bars, col in zip(ax.containers, ['bob_pct_total', 'jordan_pct_total', 'shelby_pct_total']):
ax.bar_label(bars, labels=['' if val == 0 else f'{val}' for val in df_pivot[col]])
plt.tight_layout()
plt.show()
PS: To skip labeling the first bars, you could experiment with:
for bars, col in zip(ax.containers, ['bob_pct_total', 'jordan_pct_total', 'shelby_pct_total']):
labels=['' if val == 0 else f'{val}' for val in df_pivot[col]]
labels[0] = ''
ax.bar_label(bars, labels=labels)
I am working with categorical variables in Machine Learning.Here is sample of my data:
age,gender,height,class,label
25,m,43,A,0
35,f,45,B,1
12,m,36,C,0
14,f,42,A,0
There are two categorical variables gender and height.I have used LabelEncoding technique.
My code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data=X.iloc[:,:].values
lben = LabelEncoder()
data[:,1] = lben.fit_transform(data[:,1])
data[:,3] = lben.fit_transform(data[:,3])
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
onehotencoder = OneHotEncoder(categorical_features=[3])
data = onehotencoder.fit_transform(data).toarray()
print(data.shape)
np.savetxt('data.csv',data,fmt='%s')
The data.csv looks like this:
0.0 0.0 1.0 0.0 0.0 1.0 25.0 0.0
0.0 0.0 0.0 1.0 1.0 0.0 35.0 1.0
1.0 0.0 0.0 0.0 0.0 1.0 12.0 2.0
0.0 1.0 0.0 0.0 1.0 0.0 14.0 0.0
I am unable to understand why the column is like this i.e where is the value of the 'height' column.Also the data.shape is (4,8) instead of (4,7) i.e(gender represented by 2 columns and class by 3 and 'age' and 'height' features.
Are you sure that you need to use LabelEncoder+OneHotEncoder? There is a much simpler method (which does not allow to do advanced procedures, but so far you seem to work on basics):
import pandas as pd
import numpy as np
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data = pd.get_dummies(X)
The problem with the current code is that after you have done the first OHE:
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
the columns get shifted and column 3 is in fact the original height column instead of the label-encoded class column. So change the second one to use column 4 and you will get what you want.
I would like to compute a daily percentage change for this DataFrame (frame_):
import pandas as pd
import numpy as np
data_ = {
'A':[1,np.NaN,2,1,1,2],
'B':[1,2,3,1,np.NaN,1],
'C':[1,2,np.NaN,1,1,2],
}
dates_ = [
'06/01/2018','05/01/2018','04/01/2018','03/01/2018','02/01/2018', '01/01/2018'
]
frame_ = pd.DataFrame(data_, index=dates_, columns=['A','B','C'])
The issue is that I get a DataFrame with this method:
returns_ = frame_.pct_change(periods=1, fill_method='pad')
dates,A,B,C
06/01/2018,,,
05/01/2018,,1.0,1.0
04/01/2018,1.0,0.5,
03/01/2018,-0.5,-0.6666666666666667,-0.5
02/01/2018,0.0,,0.0
01/01/2018,1.0,0.0,1.0
Which is not what I am looking for. And the dropna() method also doesn't give me the result I seek. I would like to compute a value for each day which has value and NaN for the day where there is no value or NaN. For example, on column A: as a percentage change I would like to see
dates,A
06/01/2018,1
05/01/2018,
04/01/2018,1.0
03/01/2018,-0.5
02/01/2018,0.0
01/01/2018,1.0
Many thanks in advance
This is one way, a bit by brute-force.
import pandas as pd
import numpy as np
data_ = {
'A':[1,np.NaN,2,1,1,2],
'B':[1,2,3,1,np.NaN,1],
'C':[1,2,np.NaN,1,1,2],
}
dates_ = [
'06/01/2018','05/01/2018','04/01/2018','03/01/2018','02/01/2018', '01/01/2018'
]
frame_ = pd.DataFrame(data_, index=dates_, columns=['A','B','C'])
frame_ = pd.concat([frame_, pd.DataFrame(columns=['dA', 'dB', 'dC'])])
for col in ['A', 'B', 'C']:
frame_['d'+col] = frame_[col].pct_change()
frame_.loc[pd.notnull(frame_[col]) & pd.isnull(frame_['d'+col]), 'd'+col] = frame_[col]
# A B C dA dB dC
# 06/01/2018 1.0 1.0 1.0 1.0 1.000000 1.0
# 05/01/2018 NaN 2.0 2.0 NaN 1.000000 1.0
# 04/01/2018 2.0 3.0 NaN 1.0 0.500000 NaN
# 03/01/2018 1.0 1.0 1.0 -0.5 -0.666667 -0.5
# 02/01/2018 1.0 NaN 1.0 0.0 NaN 0.0
# 01/01/2018 2.0 1.0 2.0 1.0 0.000000 1.0
I have the following dataframe data:
import pandas as pd
from io import StringIO
data = pd.read_table(StringIO("""time_diff avg_trips_per_day
631 1.0
231 1.0
431 1.0
7031 1.0
17231 1.0
20000 20.0
21000 15.0
22000 10.0"""), delim_whitespace=True)
I create a barchart as folows:
import seaborn as sns
data['timegroup'] = pd.qcut(data['time_diff'], 3)
sns.barplot(x='timegroup', y='avg_trips_per_day', data=data)
Currently it takes the values of avg_trips_per_day for each bin (timegroup) and calculates a mean avg_trips_per_day.
However, I want to sum-up the values of avg_trips_per_day for each bin timegroup instead of using mean. How can I do this?
Use the estimator parameter of barplot:
sns.barplot(x='timegroup', y='avg_trips_per_day', data=data, estimator=sum)
I am trying to create a stacked bar graph to show how the launch vehicles of satellites has changed over time. I'd like the x axis to be the year of the launch, and y axis to be the number of satellites launched on the vehicle, where each section of the bar is a different color that represents the launch vehicle. I am struggling to come up with a way to do this because my Launch Vehicle column is non-numerical. I looked into the group by function as well as value_counts but can't seem to get it to do what I am looking for.
You have to reorganize your data to use DataFrame.plot in desired way:
import pandas as pd
import matplotlib.pylab as plt
# test data
df = pd.DataFrame({'Launch Vehicle':["Soyuz 2.1a",'Ariane 5 ECA','Falcon 9','Long March','Falcon 9', 'Atlas 3','Atlas 3'],
'Year of Launch': [2016,2014,2016,1997,2015,2004,2004]})
# make groupby by year and rocket type to get the pivot table
# fillna put zero launch if there is no start of such type during the year
df2 = df.groupby(['Year of Launch','Launch Vehicle'])['Year of Launch'].count().unstack('Launch Vehicle').fillna(0)
print(df2)
# plot the data
df2.plot(kind='bar', stacked=True, rot=1)
plt.show()
Output of df2:
Launch Vehicle Ariane 5 ECA Atlas 3 Falcon 9 Long March Soyuz 2.1a
Year of Launch
1997 0.0 0.0 0.0 1.0 0.0
2004 0.0 2.0 0.0 0.0 0.0
2014 1.0 0.0 0.0 0.0 0.0
2015 0.0 0.0 1.0 0.0 0.0
2016 0.0 0.0 1.0 0.0 1.0