Pandas python barplot by subgroup - python

Ok so I have a dataframe object that's indexed as follows:
index, rev, metric1 (more metrics.....)
exp1, 92365, 0.018987
exp2, 92365, -0.070901
exp3, 92365, 0.150140
exp1, 87654, 0.003008
exp2, 87654, -0.065196
exp3, 87654, -0.174096
For each of these metrics I want to create individual stacked barplots comparing them based on their rev.
here's what I've tried:
df = df[['rev', 'metric1']]
df = df.groupby("rev")
df.plot(kind = 'bar')
This results in 2 individual bar graphs of the metric. Ideally I would have these two merged and stacked (right now stacked=true does nothing). Any help would be much appreciated.
This would give me my ideal result, however I don't think reorganizing to fit this is the best way to achieve my goal as I have many metrics and many revisions.
index, metric1(rev87654), metric1(rev92365)
exp1, 0.018987, 0.003008
exp2, -0.070901, -0.065196
exp3, 0.150140, -0.174096
This is my goal. (made by hand)
http://i.stack.imgur.com/5GRqB.png

following from this matplotlib gallery example:
http://matplotlib.org/examples/api/barchart_demo.html
there they get multiple to plot by calling bar once for each set.
You could access these values in pandas with indexing operations as follows:
fig, ax = subplots(figsize=(16.2,10),dpi=300)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[0]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))
ax.bar(X,Y,width=.4)
Y = Tire2[Tire2.SL==Tire2.SL.unique()[2]].SA.values[0:13]
X = linspace(0,size(Y),size(Y))+.5
ax.bar(X,Y,width=.4,color='r')
working from the inside out:
get all of the unique values of 'SL' in one of the cols (rev in your case)
Get a Boolean vector of all rows where 'SL' equals the first (or nth) unique value
Index Tire by that Boolean vector (this will pull out only those rows where the vector is True
access the values of SA or a metric in yourcase. (took only the `[0:13]' values because i was testing this on a huge data set)
bar plot those values
if your experiments are consistently in order in the frame(as shown), that's that. Otherwise you might need to run a little sorting to get your Y values in the right order. .sort(column name) should take care of that. In my code, i'd slip it in between ...[0]] and.SA...
In general, this kind of operation can really help you out in wrangling big frames. .between is useful. And you can always add, multiply etc. the Boolean vectors to construct more complex logic.

I'm not sure how to get the plot you want automatically without doing exactly the reorganization you specify at the end. The answer by user3823992 gives you more detailed control of the plots, but if you want them more automatic here is some temporary reorganization that should work using the indexing similarly but also concatenating back into a DataFrame that will do the plot for you.
import numpy as np
import pandas as pd
exp = ['exp1','exp2','exp3']*2
rev = [1,1,1,2,2,2]
met1 = np.linspace(-0.5,1,6)
met2 = np.linspace(1.0,5.0,6)
met3 = np.linspace(-1,1,6)
df = pd.DataFrame({'rev':rev, 'met1':met1, 'met2':met2, 'met3':met3}, index=exp)
for met in df.columns:
if met != 'rev':
merged = df[df['rev'] == df.rev.unique()[0]][met]
merged.name = merged.name+'rev'+str(df.rev.unique()[0])
for rev in df.rev.unique()[1:]:
tmp = df[df['rev'] == rev][met]
tmp.name = tmp.name+'rev'+str(rev)
merged = pd.concat([merged, tmp], axis=1)
merged.plot(kind='bar')
This should give you three plots, one for each of my fake metrics.
EDIT : Or something like this might do also
df['exp'] = df.index
pt = pd.pivot_table(df, values='met1', rows=['exp'], cols=['rev'])
pt.plot(kind='bar')

Related

Best method for non-regular index-based interpolation on grouped dataframes

Problem statement
I had the following problem:
I have samples that ran independent tests. In my dataframe, tests of sample with the same "test name" are also independent. So the couple (test,sample) is independent and unique.
data are collected at non regular sampling rates, so we're speaking about unequaly spaced indices. This "time series" index is called unreg_idx in the example. For the sake of simplicity, it is a float between 0 & 1.
I want to figure out what the value at a specific index, e.g. for unreg_idx=0.5. If the value is missing, I just want a linear interpolation that depends on the index. If extrapolating because the value is at an extremum in the sorted unreg_idx of the group (test,sample), it can leave NaN.
Note the following from pandas documentation:
Please note that only method='linear' is supported for
DataFrame/Series with a MultiIndex.
’linear’: Ignore the index and treat the values as equally spaced.
This is the only method supported on MultiIndexes.
The only solution I found is long, complex and slow. I am wondering if I am missing out on something, or on the contrary something is missing from the pandas library. I believe this is a typical issue in scientific and engineering fields to have independent tests on various samples with non regular indices.
What I tried
sample data set preparation
This part is just for making an example
import pandas as pd
import numpy as np
tests = (f'T{i}' for i in range(20))
samples = (chr(i) for i in range(97,120))
idx = pd.MultiIndex.from_product((tests,samples),names=('tests','samples'))
idx
dfs=list()
for ids in idx:
group_idx = pd.MultiIndex.from_product(((ids[0],),(ids[1],),tuple(np.random.random_sample(size=(90,))))).sort_values()
dfs.append(pd.DataFrame(1000*np.random.random_sample(size=(90,)),index=group_idx))
df = pd.concat(dfs)
df = df.rename_axis(index=('test','sample','nonreg_idx')).rename({0:'value'},axis=1)
The (bad) solution
add_missing = df.index.droplevel('nonreg_idx').unique().to_frame().reset_index(drop=True)
add_missing['nonreg_idx'] = .5
add_missing = pd.MultiIndex.from_frame(add_missing)
added_missing = df.reindex(add_missing)
df_full = pd.concat([added_missing.loc[~added_missing.index.isin(df.index)], df])
df_full.sort_index(inplace=True)
def interp_fnc(group):
try:
return group.reset_index(['test','sample']).interpolate(method='slinear').set_index(['test','sample'], append=True).reorder_levels(['test','sample','value']).sort_index()
except:
return group
grouped = df_full.groupby(level=['test','sample'])
df_filled = grouped.apply(
interp_fnc
)
Here, the wanted values are in df_filled. So I can do df_filled.loc[(slice(None), slice(None), .5),'value'] to get what I need for each sample/test.
I would have expected to be able to do the same within 1 or maximum 2 lines of code. I have 14 here. apply is quite a slow method. I can't even use numba.
Question
Can someone propose a better solution?
If you think there is no better alternative, please comment and I'll open an issue...

How can I create a stacked barchart with timedeltas using matplotlib?

just getting into data visualization with pandas. At the moment i try to visualize a pd with matplotlib that looks like this:
Initiative_160608 Initiative_160570 Initiative_160056
Beschluss_BR 2009-05-15 2009-05-15 2006-04-07
Vorlage_BT 2009-05-22 2009-05-22 2006-04-26
Beratung_BT 2009-05-28 2009-05-28 2006-05-11
ABeschluss_BT 2009-06-17 2009-06-17 2006-05-17
Beschlussempf 2009-06-17 2009-06-17 2006-05-26
As you can see, i have a number of columns with five different dates (every date symbolizes one event in a total chain of five events). Now to the problem:
My plan is to visualize shown data with a stacked horizontal chart, using the timedeltas between the 5 different events (how many days have passed between the first and last event, including the dates in between). Every Column should represent one bar in the chart. The whole chart is not about the absolute time that has passed, but about the duration of the five events in relation to the overall duration of one column, which means that all bars should have the same overall length.
Yet i haven`t found anything similar or found a solution by myself. I would be extremely thankful for any kind of solution to proceed with the shown data.
I'm not exactly sure if this is what you are looking for, but if each column is supposed to be a bar, and you want the time deltas within each column, then you need the difference in days between each row, and I am guessing the first row should have a difference of 0 days (since it is the starting point).
Also for stacked barplots, the index is used to create the categories, but in your case, you want the columns as categories, and each bar to be composed of the different index values. This means you need to transpose your df eventually.
This solution is pretty ugly, but hopefully it helps.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
"Initiative_160608": ['2009-05-15', '2009-05-22', '2009-05-28', '2009-06-17', '2009-06-17'],
"Initiative_160570": ['2009-05-15', '2009-05-22', '2009-05-28', '2009-06-17', '2009-06-17'],
"Initiative_160056": ['2006-04-07', '2006-04-26', '2006-05-11', '2006-05-17', '2006-05-26']})
df.index = ['Beschless_BR', 'Vorlage_BT', 'Beratung_BT', 'ABeschless_BT', 'Beschlussempf']
# convert everything to dates
df = df.apply(lambda x: pd.to_datetime(x, format="%Y-%m-%d"))
def get_days(x):
diff_list = []
for i in range(len(x)):
if i == 0:
diff_list.append(x[i] - x[i])
else:
diff_list.append(x[i] - x[i-1])
return diff_list
# get the difference in days, then convert back to numbers
df_diff = df.apply(lambda x: get_days(x), axis = 0)
df_diff = df_diff.apply(lambda x: x.dt.days)
# transpose the matrix so that each initiative becomes a stacked bar
df_diff = df_diff.transpose()
# replace 0 values with 0.2 so that the bars are visible
df_diff = df_diff.replace(0, 0.2)
df_diff.plot.bar(stacked = True)
plt.show()

Alternative method for two way interpolation

I wrote some code to perform interpolation based on two criteria, the amount of insurance and the deductible amount %. I was struggling to do the interpolation all at once, so had split the filtering.The table hf contains the known data which I am using to base my interpolation results on.Table df contains the new data which needs the developed factors interpolated based on hf.
Right now my work around is first filtering each table based on the ded_amount percentage and then performing the interpolation into an empty data frame and appending after each loop.
I feel like this is inefficient, and there is a better way to perform this, looking to hear some feedback on some improvements I can make. Thanks
Test data provided below.
import pandas as pd
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
deduct_fact=pd.DataFrame()
for deduct in hf['Ded_amount'].unique():
deduct_table=hf[hf['Ded_amount']==deduct]
aoi_table=df[df['Ded_amount']==deduct]
x=deduct_table['AOI']
y=deduct_table['factor']
f=interpolate.interp1d(x,y,fill_value="extrapolate")
xnew=aoi_table[['AOI']]
ynew=f(xnew)
append_frame=aoi_table
append_frame['Factor']=ynew
deduct_fact=deduct_fact.append(append_frame)
Yep, there is a way to do this more efficiently, without having to make a bunch of intermediate dataframes and appending them. have a look at this code:
from scipy import interpolate
known_data={'AOI':[80000,100000,150000,200000,300000,80000,100000,150000,200000,300000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%'],'factor':[0.797,0.774,0.739,0.733,0.719,0.745,0.737,0.715,0.711,0.709]}
new_data={'AOI':[85000,120000,130000,250000,310000,85000,120000,130000,250000,310000],'Ded_amount':['2%','2%','2%','2%','2%','3%','3%','3%','3%','3%']}
hf=pd.DataFrame(known_data)
df=pd.DataFrame(new_data)
# Create this column now
df['Factor'] = None
# I like specifying this explicitly; easier to debug
deduction_amounts = list(hf.Ded_amount.unique())
for deduction_amount in deduction_amounts:
# You can index a dataframe and call a column in one line
x, y = hf[hf['Ded_amount']==deduction_amount]['AOI'], hf[hf['Ded_amount']==deduction_amount]['factor']
f = interpolate.interp1d(x, y, fill_value="extrapolate")
# This is the most important bit. Lambda function on the dataframe
df['Factor'] = df.apply(lambda x: f(x['AOI']) if x['Ded_amount']==deduction_amount else x['Factor'], axis=1)
The way the lambda function works is:
It goes row by row through the column 'Factor' and gives it a value based on conditions on the other columns.
It returns the interpolation of the AOI column of df (this is what you called xnew) if the deduction amount matches, otherwise it just returns the same thing back.

Interpolating from a pandas DataFrame or Series to a new DatetimeIndex

Let's say I have an hourly series in pandas, fine to assume the source is regular but it is gappy. If I want to interpolate it to 15min, the pandas API provides resample(15min).interpolate('cubic'). It interpolates to the new times and provides some control over the limits of interpolation. The spline is helping to refine the series as well as fill small gaps. To be concrete:
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan # too wide a gap
signal[160:168:2] = np.nan # these can be interpolated
df = pd.DataFrame({"signal":signal},index=tndx)
df1= df.resample('15min').interpolate('cubic',limit=9)
Now let's say I have an irregular datetime index. In the example below, the first time is a regular time point, the second is in the big gap and the last is in the interspersed brief gaps.
tndx2 = pd.DatetimeIndex('2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00')
How do I interpolate to from the original series (hourly) to this irregular series of times?
Is the only option to build a series that includes the original data and the destination data? How would I do this? What is the most economical way to achieve the goals of interpolating to an independent irregular index and imposing a gap limit?
In case of irregular timestamps, first you set datetime as index and then you can use interpolate method to index df1= df.resample('15min').interpolate('index')
You can find more information here https://pandas.pydata.org/pandas-docs/version/0.16.2/generated/pandas.DataFrame.interpolate.html
This is an example solution within the pandas interpolate API, which doesn't seem to have a way of using abscissa and values from the source series to interpolate to new times provided by the destination index, as separate data structure. This method solves this by tacking the destination to the source. The method makes use of the limit argument of df.interpolate and it can use any interpolation algorithm from that API but it isn't perfect because the limit is in terms of the number of values and if there are a lot of destination points in a patch of NaNs those get counted as well.
tndx = pd.date_range(start="2019-01-01",end="2019-01-10",freq="H")
tnum = np.arange(0.,len(tndx))
signal = np.cos(tnum*2.*np.pi/24.)
signal[80:85] = np.nan
signal[160:168:2] = np.nan
df = pd.DataFrame({"signal":signal},index=tndx)
# Express the destination times as a dataframe and append to the source
tndx2 = pd.DatetimeIndex(['2019-01-04 00:00','2019-01-04 10:17','2019-01-07 16:00'])
df2 = pd.DataFrame( {"signal": [np.nan,np.nan,np.nan]} , index = tndx2)
big_df = df.append(df2,sort=True)
# At this point there are duplicates with NaN values at the bottom of the DataFrame
# representing the destination points. If these are surrounded by lots of NaNs in the source frame
# and we want the limit argument to work in the call to interpolate, the frame has to be sorted and duplicates removed.
big_df = big_df.loc[~big_df.index.duplicated(keep='first')].sort_index(axis=0,level=0)
# Extract at destination locations
interpolated = big_df.interpolate(method='cubic',limit=3).loc[tndx2]

Seaborn heat map with gaps

I am using a dataset from kaggle and trying to do some data analysis on that.
First I computed the price average per group of brand and vehicle type (This is my average code) and after that, I did the heat map from this average (Heat map code)(Heat map figure). However, a noticed that in the dataset some brands do not have all vehicle types, for instance, alfa_romeo does not show "bus" type. This becomes a problem because this absence appears as a gap in Heat Map.
How can I overcome this situation, for example, putting zero value where there is a gap?
Try adding the argument, fill_value = 0 to your df.pivot in your Heat map code. This should replace NULL values with 0 and prevent gaps from showing up in your heat map.
EDIT: Error with my solution since pandas.DataFrame.pivot does not have an argument for fill_value. A much better alternative would be pandas.pivot_table, which is more or less equivalent to pandas.pivot but with more flexibility. See here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
Here's how your line should be rewritten:
import pandas as pd
df2_pivot = pd.pivot_table(data = df2,
index = 'brand',
columns = 'vehicleType',
values = 'avgPrice',
fill_value = 0)
Alternatively, you can also run:
df2_pivot = df2.pivot(index = 'brand',
columns = 'vehicleType',
values = 'avgPrice').fillna(0)

Categories

Resources