Subtraction of two series from different parts of the dataframe - python

I have the following data frame:
SID AID START END
71 1 1 -11136 -11122
74 1 1 -11121 -11109
78 1 1 -11034 -11014
79 1 2 -11137 -11152
83 1 2 -11114 -11127
86 1 2 -11032 -11038
88 1 2 -11121 -11002
I want to do a subtraction of the START elements with AID==1 and AID==2, in order, such that the expected result would be:
-11136 - (-11137) = 1
-11121 - (-11114) =-7
-11034 - (-11032) =-2
Nan - (-11002) = NaN
So I extracted two groups:
values1 = group.loc[group['AID'] == 1]["START"]
values2 = group.loc[group['AID'] == 2]["START"]
with the following result:
71 -11136
74 -11121
78 -11034
Name: START, dtype: int64
79 -11137
83 -11114
86 -11032
88 -11002
Name: START, dtype: int64
and did a simple subtraction:
values1-values2
But I got all NaNs:
71 NaN
74 NaN
78 NaN
79 NaN
83 NaN
86 NaN
I noticed that if I use data from the same AID group (e.g. START-END), I get the right answer. I get the NaN only when I "mix" AID group. I'm just getting started with Pandas, but I'm obviously missing something here. Any suggestion?

Let's try this:
df.set_index([df.groupby(['SID','AID']).cumcount(),'AID'])['START'].unstack().add_prefix('col_').eval('col_1 - col_2')
Output:
0 1.0
1 -7.0
2 -2.0
3 NaN
dtype: float64

pandas does those operations based on labels. Since your labels ((71, 74, 78) and (79, 83, 86)) don't match, it cannot find any value to subtract. One way to deal with this is to use a numpy array instead of a Series so there is no label associated:
values1 - values2.values
Out:
71 1
74 -7
78 -2
Name: START, dtype: int64

Bizarre way to go about it
-np.diff([g.reset_index(drop=True) for n, g in df.groupby('AID').START])[0]
0 1.0
1 -7.0
2 -2.0
3 NaN
Name: START, dtype: float64

Related

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

Is there a way to do rolling rank in Pandas?

I am trying to rank some values in one column over a rolling period of N days instead of having the ranking done over the entire set. I have seen several methods here using rolling_apply but I have read that this is no longer in python. For example, in the following table;
A
01-01-2013
100
02-01-2013
85
03-01-2013
110
04-01-2013
60
05-01-2013
20
06-01-2013
40
For the column A above, how can I have the rank as below for N = 3;
A
Ranked_A
01-01-2013
100
NaN
02-01-2013
85
Nan
03-01-2013
110
1
04-01-2013
60
3
05-01-2013
20
3
06-01-2013
40
2
Yes we have some work around, still with rolling but need apply
df.A.rolling(3).apply(lambda x: pd.Series(x).rank(ascending=False)[-1])
01-01-2013 NaN
02-01-2013 NaN
03-01-2013 1.0
04-01-2013 3.0
05-01-2013 3.0
06-01-2013 2.0
Name: A, dtype: float64

Get maximum relative difference between row-values and row-mean in new pandas dataframe column

I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})

print(df)

a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]


# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64

Results in columns without decimal places?

I have looked through a lot of posts, but none of the solutions I can implement in my code:
x4 = x4.set_index('grupa').T.rename_axis('DANE').reset_index().rename_axis(None,1).round()
After which I get the results DataFrame:
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5.0 94.0 61.0 623.0
1 marza_netto 7.0 120.0 69.0 668.0
2 marza_procent2 32.0 34.0 29.0 27.0
But I would like to receive:
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5 94 61 623
1 marza_netto 7 120 69 668
2 marza_procent2 32 34 29 27
I tried replace('.0',''),int(round(),astype(int), but I don't get good results or I get the incompatibility of the attributes with the DataFrame.
If only non numeric column is DANE then cast before convert to column:
x4 = x4.set_index('grupa')
.T
.rename_axis('DANE')
.astype(int)
.reset_index()
.rename_axis(None,1)
More general solution is select all floats columns and cast:
cols = df.select_dtypes(include=['float']).columns
df[cols] = df[cols].astype(int)
print (df)
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5 94 61 623
1 marza_netto 7 120 69 668
2 marza_procent2 32 34 29 27
If some NaNs values convert to int is not possible.
So is possible:
1.drop all NaNs rows:
df = df.dropna()
2.replace NaNs to some integer like 0:
df = df.fillna(0)
Not 100% sure I got your question, but you can use an astype(int) conversion.
df = df.set_index('DANE').astype(int).reset_index()
df
DANE BAKALIE NASIONA OWOCE WARZYWA
0 ilosc 5 94 61 623
1 marza_netto 7 120 69 668
2 marza_procent2 32 34 29 27
If you're dealing with rows that have NaNs, either drop those rows and convert, or convert to astype(object). The latter is not recommended because you lose performance.

How to apply group by on data frame with neglecting NaN values in Pandas?

I am sorry if this is too simple, but I have searched allot and couldn't find a solution for this problem.
I am populating my data frame (df) as below:
weather = pd.read_csv(weather_path)
weather_stn1 = weather[weather['Station'] == 1][['Tavg']]
weather_stn2 = weather[weather['Station'] == 2][['Tavg']]
df = pd.DataFrame(columns=['xAxis', 'yAxis1', 'yAxis2'])
df['xAxis'] = pd.to_datetime(weather['Date'])
df['yAxis1'] = weather_stn1['Tavg']
df['yAxis2'] = weather_stn2['Tavg']
My data frame is as below:
xAxis yAxis1 yAxis2
0 2009-05-01 53 NaN
1 2009-05-01 NaN 55
2 2009-05-02 55 NaN
3 2009-05-02 NaN 55
4 2009-05-03 57 NaN
5 2009-05-03 NaN 58
but I want to have my results as below:
xAxis yAxis1 yAxis2
0 2009-05-01 53 55
2 2009-05-02 55 55
4 2009-05-03 57 58
I have been working on reindexing of weather_stn1 and weather_stn2 and in applying group by but it is not working as I want to do. It ends up with me having nothing at all to display!
How should I approach this problem?
Thanks allot for your time in advance.
Guys I have found the solution myself, in case anyone else gets stuck, it would be helpful.
df = pd.DataFrame(columns=['xAxis', 'yAxis1', 'yAxis2'])
df['xAxis'] = pd.to_datetime(weather['Date'])
df['yAxis1'] = weather_stn1['Tavg']
df['yAxis2'] = weather_stn2['Tavg']
plot_df = plot_df.groupby(plot_df['xAxis']).mean()
print plot_df.reset_index()
Now my output is as:
xAxis yAxis1 yAxis2
0 2009-05-01 53 55
1 2009-05-02 55 55
2 2009-05-03 57 58
3 2009-05-04 57 60
4 2009-05-05 60 62
5 2009-05-06 63 66
That simple it was!
What you really want to do is to pivot the table so that the values in station columns become column headers. Try this:
df = weather.pivot(index='Date', columns='Station', values='Tavg')
If there is no more than one record for each station for each date then you will get what you want, except that the dates will be the index rather than a column.
You can reset the index and change the column names after if you like.

Categories

Resources