Shifting column in multiindex dataframe with missting dates - python

I'd like to shift a column in a multiindex dataframe in order to calculate a regression model with a lagged independent variable. As my time-series has missing values I only want to have the values shifted for known previous days. The df looks like that:
cost
ID day
1 31.01.2020 0
1 03.02.2020 0
1 04.02.2020 0.12
1 05.02.2020 0
1 06.02.2020 0
1 07.02.2020 0.08
1 10.02.2020 0
1 11.02.2020 0
1 12.02.2020 0.03
1 13.02.2020 0.1
1 14.02.2020 0
The desired output would like that:
cost cost_lag
ID day
1 31.01.2020 0 NaN
1 03.02.2020 0 NaN
1 04.02.2020 0.12 0
1 05.02.2020 0 0.12
1 06.02.2020 0 0
1 07.02.2020 0.08 0
1 10.02.2020 0 NaN
1 11.02.2020 0 0
1 12.02.2020 0.03 0
1 13.02.2020 0.1 0.03
1 14.02.2020 0 0.1
Based on this answer to a similar question I've tried the following:
df['cost_lag'] = df.groupby(['id'])['cost'].shift(1)[df.reset_index().day == df.reset_index().day.shift(1) + datetime.timedelta(days=1)]
But that results in an error message I don't understand:
IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match
I've also tried to fill the missing dates following an approach suggested here:
ams_spend_ranking_df = ams_spend_ranking_df.index.get_level_values(1).apply(lambda x: datetime.datetime(x, 1, 1))
again resulting in an error message which does not enlighten me:
AttributeError: 'DatetimeIndex' object has no attribute 'apply'
Long story short: how can I shift the cost column by 1 day and add NaNs if I don't have data on the previous day?

You can add all missing datetimes by DataFrameGroupBy.resample with Resampler.asfreq:
df1 = df.reset_index(level=0).groupby(['ID'])['cost'].resample('d').asfreq()
print (df1)
ID day
1 2020-01-31 0.00
2020-02-01 NaN
2020-02-02 NaN
2020-02-03 0.00
2020-02-04 0.12
2020-02-05 0.00
2020-02-06 0.00
2020-02-07 0.08
2020-02-08 NaN
2020-02-09 NaN
2020-02-10 0.00
2020-02-11 0.00
2020-02-12 0.03
2020-02-13 0.10
2020-02-14 0.00
Name: cost, dtype: float64
So then if use your solution with DataFrameGroupBy.shift it working like need:
df['cost_lag'] = df1.groupby('ID').shift(1)
print (df)
cost cost_lag
ID day
1 2020-01-31 0.00 NaN
2020-02-03 0.00 NaN
2020-02-04 0.12 0.00
2020-02-05 0.00 0.12
2020-02-06 0.00 0.00
2020-02-07 0.08 0.00
2020-02-10 0.00 NaN
2020-02-11 0.00 0.00
2020-02-12 0.03 0.00
2020-02-13 0.10 0.03
2020-02-14 0.00 0.10

Related

filter dataframe using isna() to filter ourt rows that have null value in following columns

I have dataframe similar to this one:
id name val1_rain val2_tik val3_bon val4_tig ...
0 2349 Rivi 0.11 0.34 0.78 0.21
1 3397 Mani NaN NaN NaN NaN
2 0835 Pigi 0.34 NaN 0.32 NaN
3 5093 Tari 0.65 0.12 0.34 2.45
4 2340 Yoti NaN NaN NaN NaN
I want to drop any row that has all null values for all the columns that come after the name column ( [:,2:]).
So the result output would look like this:
id name val1_rain val2_tik val3_bon val4_tig ...
0 2349 Rivi 0.11 0.34 0.78 0.21
2 0835 Pigi 0.34 NaN 0.32 NaN
3 5093 Tari 0.65 0.12 0.34 2.45
I have tried to do something like this:
df[~df.iloc[:,2:].isnull()]
but that raised an error:
ValueError: cannot reindex from a duplicate axis
First of all, I'm not sure why the error speaks about duplicate axis.
Then, I would like to find a way that I can have only rows that have any value at any column after the 2nd column.
I haven't found any question similar to this.
You can filter if exist at least one non missing values after second columns with DataFrame.notna and DataFrame.any:
df = df[df.iloc[:,2:].notna().any(axis=1)]
print (df)
id name val1_rain val2_tik val3_bon val4_tig
0 2349 Rivi 0.11 0.34 0.78 0.21
2 835 Pigi 0.34 NaN 0.32 NaN
3 5093 Tari 0.65 0.12 0.34 2.45

Pandas function for subtract a cumulative column monthly

I have a weather dataset, it gave me many years of data, as below:
Date
rainfallInMon
2009-01-01
0.0
2009-01-02
0.03
2009-01-03
0.05
2009-01-04
0.05
2009-01-05
0.06
...
...
2009-01-29
0.2
2009-01-30
0.21
2009-01-31
0.21
2009-02-01
0.0
2009-02-02
0.0
...
...
I am trying to get the daily rainfall, starting from the end of the month subtracting the previous day. For eg:
Date
rainfallDaily
2009-01-01
0.0
2009-01-02
0.03
2009-01-03
0.02
...
...
2009-01-29
0.01
2009-01-30
0.0
...
...
Thanks for your efforts in advance.
Because there is many years of data need Series.dt.to_period for month periods for distinguish months with years separately:
df['rainfallDaily'] = (df.groupby(df['Date'].dt.to_period('m'))['rainfallInMon']
.diff()
.fillna(0))
Or use Grouper:
df['rainfallDaily'] = (df.groupby(pd.Grouper(freq='M',key='Date'))['rainfallInMon']
.diff()
.fillna(0))
print (df)
Date rainfallInMon rainfallDaily
0 2009-01-01 0.00 0.00
1 2009-01-02 0.03 0.03
2 2009-01-03 0.05 0.02
3 2009-01-04 0.05 0.00
4 2009-01-05 0.06 0.01
5 2009-01-29 0.20 0.14
6 2009-01-30 0.21 0.01
7 2009-01-31 0.21 0.00
8 2009-02-01 0.00 0.00
9 2009-02-02 0.00 0.00
Try:
# Convert to datetime if it's not already the case
df['Date'] = pd.to_datetime(df['Date'])
df['rainfallDaily'] = df.resample('M', on='Date')['rainfallInMon'].diff().fillna(0)
print(df)
# Output
Date rainfallInMon rainfallDaily
0 2009-01-01 0.00 0.00
1 2009-01-02 0.03 0.03
2 2009-01-03 0.05 0.02
3 2009-01-04 0.05 0.00
4 2009-01-05 0.06 0.01
5 2009-01-29 0.20 0.14
6 2009-01-30 0.21 0.01
7 2009-01-31 0.21 0.00
8 2009-02-01 0.00 0.00
9 2009-02-02 0.00 0.00

Is there a way to plot corresponding points of two data frames?

I have two dataframes with the same columns and date indices:
df1:
Date T.TO AS.TO NTR.TO ... R.TO
2016-03-03 0.1 0.02 0.04 0.02
2016-03-04 0.09 0.01 0.02 0.02
2016-03-05 0.1 0.02 0.04 0.02
...
2019-03-03 0.09 0.01 0.02 0.02
df2:
Date T.TO AS.TO NTR.TO ... R.TO
2016-03-03 0.01 0.32 0.04 0.02
2016-03-04 0.81 0.21 0.02 0.02
2016-03-05 0.01 0.12 0.04 0.02
...
2019-03-03 0.89 0.11 0.12 0.72
I want to plot all the matching points of the two dataframes on a chart like the first point would correspond to 2016-03-03, T.TO (0.1, 0.01). Another point would correspond to 2016-03-03, AS.TO (0.02, 0.32) and so on giving me a large number of points. I will then use these to find a line of best fit.
I know how to find the best fit line but I am having difficulty plotting these points directly. I tried using nested for loops and dictionaries but I was wondering if there is a more straightforward approach to this?
To plot these points, you can stack:
plt.scatter(df1.set_index('Date').stack(), df2.set_index('Date').stack())
Output:
If you want to drop out all the data that is not common between the two dataframes then this should work.
In [71]: df = pd.read_clipboard()
In [72]: df
Out[72]:
Date T.TO AS.TO NTR.TO ... R.TO
0 2016-03-03 0.10 0.02 0.04 0.02 NaN
1 2016-03-04 0.09 0.01 0.02 0.02 NaN
2 2016-03-05 0.10 0.02 0.04 0.02 NaN
3 ... NaN NaN NaN NaN NaN
4 2019-03-03 0.09 0.01 0.02 0.02 NaN
In [73]: df2 = pd.read_clipboard()
In [74]: df2
Out[74]:
Date T.TO AS.TO NTR.TO ... R.TO
0 2016-03-03 0.01 0.32 0.04 0.02 NaN
1 2016-03-04 0.81 0.21 0.02 0.02 NaN
2 2016-03-05 0.01 0.12 0.04 0.02 NaN
3 ... NaN NaN NaN NaN NaN
4 2019-03-03 0.89 0.11 0.12 0.72 NaN
Then df3 can only have values that match the two datasets
In [75]: df3 = df[df==df2]
In [76]: df3
Out[76]:
Date T.TO AS.TO NTR.TO ... R.TO
0 2016-03-03 NaN NaN 0.04 0.02 NaN
1 2016-03-04 NaN NaN 0.02 0.02 NaN
2 2016-03-05 NaN NaN 0.04 0.02 NaN
3 ... NaN NaN NaN NaN NaN
4 2019-03-03 NaN NaN NaN NaN NaN
From there plotting is a simple matter.

Convert upper triangular matrix to lower triangular matrix in Pandas Dataframe

I tried using transpose and adding some twists to it but it didn't workout
Convert Upper:
Data :
0 1 2 3
0 5 NaN NaN NaN
1 1 NaN NaN NaN
2 0.21 0.31 0.41 0.51
3 0.32 0.42 0.52 NaN
4 0.43 0.53 NaN NaN
5 0.54 NaN NaN Nan
to:
Data :
0 1 2 3
0 5 NaN NaN NaN
1 1 NaN NaN NaN
2 0.21 NaN NaN NaN
3 0.31 0.32 NaN NaN
4 0.41 0.42 0.43 NaN
5 0.51 0.52 0.53 0.54
without effecting the first two rows
I believe you need justify with sort with exclude first 2 rows:
arr = justify(df.values[2:,:], invalid_val=np.nan, side='down', axis=0)
df.values[2:,:] = np.sort(arr, axis=1)
print (df)
0 1 2 3
0 5.00 NaN NaN NaN
1 1.00 NaN NaN NaN
2 0.21 NaN NaN NaN
3 0.31 0.32 NaN NaN
4 0.41 0.42 0.43 NaN
5 0.51 0.52 0.53 0.54
IIUC you can first index the dataframe from row 2 onwards and swap with the transpose, and then you can use justify so that all NaNs are at the top:
df.iloc[2:,:] = df.iloc[2:,:].T.values
pd.Dataframe(justify(df.values.astype(float), invalid_val=np.nan, side='down', axis=0))
0 1 2 3
0 5 NaN NaN NaN
1 1 NaN NaN NaN
2 0.21 NaN NaN NaN
3 0.31 0.32 NaN NaN
4 0.41 0.42 0.43 NaN
5 0.51 0.52 0.53 0.54

Fill zero values for combinations of unique multi-index values after groupby

To better explain by problem better lets pretend i have a shop with 3 unique customers and my dataframe contains every purchase of my customers with weekday, name and paid price.
name price weekday
0 Paul 18.44 0
1 Micky 0.70 0
2 Sarah 0.59 0
3 Sarah 0.27 1
4 Paul 3.45 2
5 Sarah 14.03 2
6 Paul 17.21 3
7 Micky 5.35 3
8 Sarah 0.49 4
9 Micky 17.00 4
10 Paul 2.62 4
11 Micky 17.61 5
12 Micky 10.63 6
The information i would like to get is the average price per unique customer per weekday. What i often do in similar situations is to group by several columns with sum and then take the average of a subset of the columns.
df = df.groupby(['name','weekday']).sum()
price
name weekday
Micky 0 0.70
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
2 3.45
3 17.21
4 2.62
Sarah 0 0.59
1 0.27
2 14.03
4 0.49
df = df.groupby(['weekday']).mean()
price
weekday
0 6.576667
1 0.270000
2 8.740000
3 11.280000
4 6.703333
5 17.610000
6 10.630000
Of course this only works if all my unique customers would have at least one purchase per day.
Is there an elegant way to get a zero value for all combinations between unique index values that have no sum after the first groupby?
My solutions has been so far to either to reindex on a multi index i created from the unique values of the grouped columns or the combination of unstack-fillna-stack but both solutions do not really satisfy me.
Appreciate your help!
IIUC, let's use unstack and fillna then stack:
df_out = df.groupby(['name','weekday']).sum().unstack().fillna(0).stack()
Output:
price
name weekday
Micky 0 0.70
1 0.00
2 0.00
3 5.35
4 17.00
5 17.61
6 10.63
Paul 0 18.44
1 0.00
2 3.45
3 17.21
4 2.62
5 0.00
6 0.00
Sarah 0 0.59
1 0.27
2 14.03
3 0.00
4 0.49
5 0.00
6 0.00
And,
df_out.groupby('weekday').mean()
Output:
price
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333
I think you can use pivot_table to do all the steps at once. I'm not exactly sure what you want but the default aggregation from pivot_table is the mean. You can change it to 'sum'.
df1 = df.pivot_table(index='name', columns='weekday', values='price',
fill_value=0, aggfunc='sum')
weekday 0 1 2 3 4 5 6
name
Micky 0.70 0.00 0.00 5.35 17.00 17.61 10.63
Paul 18.44 0.00 3.45 17.21 2.62 0.00 0.00
Sarah 0.59 0.27 14.03 0.00 0.49 0.00 0.00
And then take the mean of each column.
df1.mean()
weekday
0 6.576667
1 0.090000
2 5.826667
3 7.520000
4 6.703333
5 5.870000
6 3.543333

Categories

Resources