Stacked Bar Plot-Starting with NonNumerical Items - python

I am trying to create a stacked bar graph to show how the launch vehicles of satellites has changed over time. I'd like the x axis to be the year of the launch, and y axis to be the number of satellites launched on the vehicle, where each section of the bar is a different color that represents the launch vehicle. I am struggling to come up with a way to do this because my Launch Vehicle column is non-numerical. I looked into the group by function as well as value_counts but can't seem to get it to do what I am looking for.

You have to reorganize your data to use DataFrame.plot in desired way:
import pandas as pd
import matplotlib.pylab as plt
# test data
df = pd.DataFrame({'Launch Vehicle':["Soyuz 2.1a",'Ariane 5 ECA','Falcon 9','Long March','Falcon 9', 'Atlas 3','Atlas 3'],
'Year of Launch': [2016,2014,2016,1997,2015,2004,2004]})
# make groupby by year and rocket type to get the pivot table
# fillna put zero launch if there is no start of such type during the year
df2 = df.groupby(['Year of Launch','Launch Vehicle'])['Year of Launch'].count().unstack('Launch Vehicle').fillna(0)
print(df2)
# plot the data
df2.plot(kind='bar', stacked=True, rot=1)
plt.show()
Output of df2:
Launch Vehicle Ariane 5 ECA Atlas 3 Falcon 9 Long March Soyuz 2.1a
Year of Launch
1997 0.0 0.0 0.0 1.0 0.0
2004 0.0 2.0 0.0 0.0 0.0
2014 1.0 0.0 0.0 0.0 0.0
2015 0.0 0.0 1.0 0.0 0.0
2016 0.0 0.0 1.0 0.0 1.0

Related

How to count values over different thresholds per column in Pandas dataframe groupby?

My goal is for a dataset similar to the example below, to group by [s_num, ip, f_num, direction] and then filter the score columns using separate thresholds and count how many values are above the threshold.
id s_num ip f_num direction algo_1_x algo_2_x algo_1_score algo_2_score
0 0.0 0.0 0.0 0.0 X -4.63 -4.45 0.624356 0.664009
15 19.0 0.0 2.0 0.0 X -5.44 -5.02 0.411217 0.515843
16 20.0 0.0 2.0 0.0 X -12.36 -5.09 0.397237 0.541112
20 24.0 0.0 2.0 1.0 X -4.94 -5.15 0.401744 0.526032
21 25.0 0.0 2.0 1.0 X -4.78 -4.98 0.386410 0.564934
22 26.0 0.0 2.0 1.0 X -4.89 -5.03 0.394326 0.513896
24 28.0 0.0 2.0 2.0 X -4.78 -5.00 0.420078 0.521993
25 29.0 0.0 2.0 2.0 X -4.91 -5.14 0.407355 0.485878
26 30.0 0.0 2.0 2.0 X 11.83 -4.97 0.392242 0.659122
27 31.0 0.0 2.0 2.0 X -4.73 -5.07 0.377011 0.524774
​
the result should look something like:
Each entry in algo_i column is the # of values in the group larger than the corresponding threshold
So far I tried first grouping, and applying custom aggregation like so:
def count_success(x,thresh):
return ((x > thresh)*1).sum()
thresholds=[0.1,0.2]
df.groupby(attr_cols).agg({f'algo_{i+1}_score':count_success(thresh) for i, thresh in enumerate(thresholds)})
but this results in an error :
count_success() missing 1 required positional argument: 'thresh'
So, how can I pass another argument to a function using .agg( )? or is there an easier way to do it using some pandas function?
Named aggregation does not allow extra parameter to be passed to your function. You can use numpy boardcasting:
attr_cols = ["s_num", "ip", "f_num", "direction"]
score_cols = df.columns[df.columns.str.match("algo_\d+_score")]
# Convert everything to numpy to prepare for broadcasting
score = df[score_cols].to_numpy()
threshold = np.array([0.1, 0.5])
# Raise `threshold` up 2 dimensions so that every value in `score` is
# broadcast against every value in `threshold`
mask = score > threshold[:, None, None]
# Assemble the result
row_index = pd.MultiIndex.from_frame(df[attr_cols])
col_index = pd.MultiIndex.from_product([threshold, score_cols], names=["threshold", "algo"])
result = (
pd.DataFrame(np.hstack(mask), index=row_index, columns=col_index)
.groupby(attr_cols)
.sum()
)
Result:
threshold 0.1 0.5
algo algo_1_score algo_2_score algo_1_score algo_2_score
s_num ip f_num direction
0.0 0.0 0.0 X 1 1 1 1
2.0 0.0 X 2 2 0 2
1.0 X 3 3 0 3
2.0 X 4 4 0 3

Multi-index calculation to new columns

I have a dataframe like this.
status new allocation
asset csh fi eq csh fi eq
person act_type
p1 inv 0.0 0.0 100000.0 0.0 0.0 1.0
rsp 0.0 30000.0 20000.0 0.0 0.6 0.4
tfsa 10000.0 40000.0 0.0 0.2 0.8 0.0
The right three columns are percent of total for each act_type. The following does calculate the columns correctly:
# set the percent allocations
df.loc[idx[:,:],idx["allocation",'csh']] = df.loc[idx[:,:],idx["new",'csh']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
df.loc[idx[:,:],idx["allocation",'fi']] = df.loc[idx[:,:],idx["new",'fi']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
df.loc[idx[:,:],idx["allocation",'eq']] = df.loc[idx[:,:],idx["new",'eq']] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
I have tried to do these calculations on one line combining 'csh', 'fi', 'eq' as follows:
df.loc[idx[:,:],idx["new", ('csh', 'fi', 'eq')]] / df.loc[idx[:,:],idx["new",:]].sum(axis=1)
But this results in ValueError: cannot join with no level specified and no overlapping names
Any suggestions how I can reduce these three lines to one line of code so that i'm dividing ('csh','fi','eq') by the account total and getting percents in the next columns?
First idx[:,:] should be simplify by :, then use DataFrame.div by axis=0 and for new columns use rename with DataFrame.join:
df1=df.loc[:, idx["new",('csh', 'fi', 'eq')]].div(df.loc[:, idx["new",:]].sum(axis=1),axis=0)
df = df.join(df1.rename(columns={'new':'allocation'}, level=0))
print (df)
status new allocation
asset csh fi eq csh fi eq
person act_type
p1 inv 0.0 0.0 100000.0 0.0 0.0 1.0
rsp 0.0 30000.0 20000.0 0.0 0.6 0.4
tfsa 10000.0 40000.0 0.0 0.2 0.8 0.0

Using groupby().sum() on a dataframe, then plotting a pie chart with labels?

This is my first question here, I'm quite new to Python/pandas/matplolib
I have this line of code that creates a DataFrame:
repartition = sorted2017.groupby(by=sorted2017["Traitement"]).sum()
It works as I expected, except that the column title "Traitement" seems to appear on its own row:
Prix Coût net Manuvie CCQ SSQ
Traitement
masso (Véro) 213.86 0.0 144.0 69.86 0.0
ostéo (Véro) 80.00 0.0 64.0 16.00 0.0
physio (Danny) 415.00 0.0 265.0 150.00 0.0
physio (Véro) 269.00 0.0 204.8 64.20 0.0
psy (Simone) 500.00 0.0 150.0 350.00 0.0
psy (Véro) 300.00 0.0 240.0 60.00 0.0
I wanted to use the "Traitement" column as labels for my matplotlib pie chart, so I tried :
plt.pie(repartition["Prix"], labels=repartition["Traitement"])
plt.show()
But I get a KeyError. I've also tried with iloc for the labels, but then I get
ValueError : "'label' must be of length 'x'"
How can I fix this?
After groupby, "Traitement" column is in index column.
plt.pie(x=repartition["Prix"], labels=repartition.index)
plt.show()

One hot encoding error python machine learning

I am working with categorical variables in Machine Learning.Here is sample of my data:
age,gender,height,class,label
25,m,43,A,0
35,f,45,B,1
12,m,36,C,0
14,f,42,A,0
There are two categorical variables gender and height.I have used LabelEncoding technique.
My code:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data=X.iloc[:,:].values
lben = LabelEncoder()
data[:,1] = lben.fit_transform(data[:,1])
data[:,3] = lben.fit_transform(data[:,3])
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
onehotencoder = OneHotEncoder(categorical_features=[3])
data = onehotencoder.fit_transform(data).toarray()
print(data.shape)
np.savetxt('data.csv',data,fmt='%s')
The data.csv looks like this:
0.0 0.0 1.0 0.0 0.0 1.0 25.0 0.0
0.0 0.0 0.0 1.0 1.0 0.0 35.0 1.0
1.0 0.0 0.0 0.0 0.0 1.0 12.0 2.0
0.0 1.0 0.0 0.0 1.0 0.0 14.0 0.0
I am unable to understand why the column is like this i.e where is the value of the 'height' column.Also the data.shape is (4,8) instead of (4,7) i.e(gender represented by 2 columns and class by 3 and 'age' and 'height' features.
Are you sure that you need to use LabelEncoder+OneHotEncoder? There is a much simpler method (which does not allow to do advanced procedures, but so far you seem to work on basics):
import pandas as pd
import numpy as np
df=pd.read_csv('test.csv')
X=df.drop(['label'],1)
y=np.array(df['label'])
data = pd.get_dummies(X)
The problem with the current code is that after you have done the first OHE:
onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()
the columns get shifted and column 3 is in fact the original height column instead of the label-encoded class column. So change the second one to use column 4 and you will get what you want.

Scikit learn categorical features ranking

My data contained a lot of categorical data, for example, Age, color, size, race, gender and so on.
The problem is that in scikit-learn we could not set the features as a factor as in R, therefore we have to convert the categorical data in to the dummy column. As
color size
green M
red L
blue XL
convert to
color_blue color_green color_red size_L size_M size_XL
0.0 1.0 0.0 0.0 1.0 0.0
0.0 0.0 1.0 1.0 0.0 0.0
1.0 0.0 0.0 0.0 0.0 1.0
However, I would like to rank the features as the color or size, not color_blue or size_M.
Is there any possible ways to do it? or I can summarize the value from the ranking score from each related feature?
(like score for color column should be sum of (green blue and red scores))
Note that I use ExtraTreesClassifier for the ranking score calculation.

Categories

Resources