Fill multiple rows in between pandas dataframe rows on condition - python

I have a dataset like below:
pd.DataFrame({'Date':['2019-01-01','2019-01-03','2019-01-01','2019-01-04','2019-01-01','2019-01-03'],'Name':['A','A','B','B','C','C'],'Open Price':[100,200,300,400,500,600],'Close Price':[200,300,400,500,600,700]})
Now we can see that we have few day entries missing in this table. i.e 2019-01-02 for A, and 2019-01-02, 2019-01-03 for B and 2019-01-02 for C.
What I'm looking to do is add dummy rows in the dataframe for these dates,
And close price column as the same of the next open price entry for next day. And I don't care the open price, it could be either nan or 0
Expected output
pd.DataFrame({'Date':['2019-01-01','2019-01-02','2019-01-03','2019-01-01','2019-01-02','2019-01-03','2019-01-04','2019-01-01','2019-01-02','2019-01-03'],'Name':['A','A','A','B','B','B','B','C','C','C'],'Open Price':[50,'nan',150,250,'nan','nan',350,450,'nan',550],'Close Price':[200,150,300,400,350,350,500,600,550,700]})
Any help would be appreciated !

Your logic is fuzzy for how the prices should be interpolated, but to get you started, consider this, remembering to get date into a datetime dtype:
df['Date'] = pd.to_datetime(df['Date'])
df = (df.groupby('Name')
.resample('D', on='Date')
.mean()
.swaplevel()
.interpolate()
)
print(df)
Open Price Close Price
Date Name
2019-01-01 A 100.000000 200.000000
2019-01-02 A 150.000000 250.000000
2019-01-03 A 200.000000 300.000000
2019-01-01 B 300.000000 400.000000
2019-01-02 B 333.333333 433.333333
2019-01-03 B 366.666667 466.666667
2019-01-04 B 400.000000 500.000000
2019-01-01 C 500.000000 600.000000
2019-01-02 C 550.000000 650.000000
2019-01-03 C 600.000000 700.000000

Related

Trouble with using groupby and ffill

I have pandas Dataframe:
Date1 Date2 Date3 Date4 id
2019-01-01 2019-01-02 NaT 2019-01-03 111
NaT NaT 2019-01-02 NaT 111
2019-02-04 NaT 2019-02-05 2019-02-06 222
NaT 2019-02-08 NaT NaT 222
I expect:
Date1 Date2 Date3 Date4 id
2019-01-01 2019-01-02 2019-01-02 2019-01-03 111
2019-02-04 2019-02-08 2019-02-05 2019-02-06 222
I tried to use:
df = df.groupby(['id']).fillna(method='ffill')
But my process didn't execute for very long time.
Thanks for any suggestions.
The logic you want is first. This will take the first non-null value within group. Assuming those NaT indicate proper datetime columns:
df.groupby('id', as_index=False).agg('first')
# id Date1 Date2 Date3 Date4
#0 111 2019-01-01 2019-01-02 2019-01-02 2019-01-03
#1 222 2019-02-04 2019-02-08 2019-02-05 2019-02-06
ffill is wrong because it returns a DataFrame indexed exactly like the original. Here you want an aggregation that collapses to one row per groupby key. Also ffill only forward-fills, though sometimes the value you want occur only on the second row.

How can I get different statistics for a rolling datetime range up top a current value in a pandas dataframe?

I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
Target date = 2020-01-01 11:00:00
Target value in column_a = A
Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
Target date = 2019-11-01 10:00:00
Target value in column_a = A
Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
Average would be the mean of rows 11
and so on...
I cannot use the grouper since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT
The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.

How to perform a cumulative sum of distinct values in pandas dataframe

I have a dataframe like this:
id date company ......
123 2019-01-01 A
224 2019-01-01 B
345 2019-01-01 B
987 2019-01-03 C
334 2019-01-03 C
908 2019-01-04 C
765 2019-01-04 A
554 2019-01-05 A
482 2019-01-05 D
and I want to get the cumulative number of unique values over time for the 'company' column. So if a company appears at a later date they are not counted again.
My expected output is:
date cumulative_count
2019-01-01 2
2019-01-03 3
2019-01-04 3
2019-01-05 4
I've tried:
df.groupby(['date']).company.nunique().cumsum()
but this double counts if the same company appears on a different date.
Using duplicated + cumsum + last
m = df.duplicated('company')
d = df['date']
(~m).cumsum().groupby(d).last()
date
2019-01-01 2
2019-01-03 3
2019-01-04 3
2019-01-05 4
dtype: int32
Another way try to fix anky_91
(df.company.map(hash)).expanding().apply(lambda x: len(set(x)),raw=True).groupby(df.date).max()
Out[196]:
date
2019-01-01 2.0
2019-01-03 3.0
2019-01-04 3.0
2019-01-05 4.0
Name: company, dtype: float64
From anky_91
(df.company.astype('category').cat.codes).expanding().apply(lambda x: len(set(x)),raw=True).groupby(df.date).max()
This takes more code than anky's answer, but still works for the sample data:
df = df.sort_values('date')
(df.drop_duplicates(['company'])
.groupby('date')
.size().cumsum()
.reindex(df['date'].unique())
.ffill()
)
Output:
date
2019-01-01 2.0
2019-01-03 3.0
2019-01-04 3.0
2019-01-05 4.0
dtype: float64

Pandas Rolling mean based on groupby multiple columns

I have a Long format dataframe with repeated values in two columns and data in another column. I want to find SMAs for each group. My problem is : rolling() simply ignores the fact that the data is grouped by two columns.
Here is some dummy data and code.
import numpy as np
import pandas as pd
dtix=pd.Series(pd.date_range(start='1/1/2019', periods=4) )
df=pd.DataFrame({'ix1':np.repeat([0,1],4), 'ix2':pd.concat([dtix,dtix]), 'data':np.arange(0,8) })
df
ix1 ix2 data
0 0 2019-01-01 0
1 0 2019-01-02 1
2 0 2019-01-03 2
3 0 2019-01-04 3
0 1 2019-01-01 4
1 1 2019-01-02 5
2 1 2019-01-03 6
3 1 2019-01-04 7
Now when I perform a grouped rolling mean on this data, I am getting an output like this:
df.groupby(['ix1','ix2']).agg({'data':'mean'}).rolling(2).mean()
data
ix1 ix2
0 2019-01-01 NaN
2019-01-02 0.5
2019-01-03 1.5
2019-01-04 2.5
1 2019-01-01 3.5
2019-01-02 4.5
2019-01-03 5.5
2019-01-04 6.5
Desired Output:
Whereas, what I would actually like to have is this:
sma
ix1 ix2
0 2019-01-01 NaN
2019-01-02 0.5
2019-01-03 1.5
2019-01-04 2.5
1 2019-01-01 NaN
2019-01-02 4.5
2019-01-03 5.5
2019-01-04 6.5
Will appreciate your help with this.
Use another groupby by firast level (ix1) with rolling:
df1 = (df.groupby(['ix1','ix2'])
.agg({'data':'mean'})
.groupby(level=0, group_keys=False)
.rolling(2)
.mean())
print (df1)
data
ix1 ix2
0 2019-01-01 NaN
2019-01-02 0.5
2019-01-03 1.5
2019-01-04 2.5
1 2019-01-01 NaN
2019-01-02 4.5
2019-01-03 5.5
2019-01-04 6.5
In your solution affter aggregation is returned one column DataFrame, so chained rolling working with all rows, not per groups like need:
print(df.groupby(['ix1','ix2']).agg({'data':'mean'}))
data
ix1 ix2
0 2019-01-01 0
2019-01-02 1
2019-01-03 2
2019-01-04 3
1 2019-01-01 4
2019-01-02 5
2019-01-03 6
2019-01-04 7

Reindexing python data frame is creating NaN values

I have a data frame that looks like this, with monthly data points:
Date Value
1 2010-01-01 18.45
2 2010-02-01 18.13
3 2010-03-01 18.25
4 2010-04-01 17.92
5 2010-05-01 18.85
I want to make it daily data and fill in the resulting new dates with the current month value. For example:
Date Value
1 2010-01-01 18.45
2 2010-01-02 18.45
3 2010-01-03 18.45
4 2010-01-04 18.45
5 2010-01-05 18.45
....
This is the code I'm using to add the interim dates and fill the values:
today = get_datetime('US/Eastern') #.strftime('%Y-%m-%d')
enddate='1881-01-01'
idx = pd.date_range(enddate, today.strftime('%Y-%m-%d'), freq='D')
df = df.reindex(idx)
df = df.fillna(method = 'ffill')
The output is as follows:
Date Value
2010-01-01 00:00:00 NaN NaN
2010-01-02 00:00:00 NaN NaN
2010-01-03 00:00:00 NaN NaN
2010-01-04 00:00:00 NaN NaN
2010-01-05 00:00:00 NaN NaN
The logs show that the NaN values appear just before the .fillna method is invoked. So the forward fill is not the culprit.
Any ideas why this is happening?
option 3
safest approach, very general
up-sample to daily, then group monthly with a transform
The reason why this is important is that your day may not fall on the first of the month. If you want to ensure that that days value gets broadcast for every other day in the month, do this
df.set_index('Date').asfreq('D') \
.groupby(pd.TimeGrouper('M')).Value \
.transform('first').reset_index()
option 2
asfreq
df.set_index('Date').asfreq('D').ffill().reset_index()
option 3
resample
df.set_index('Date').resample('D').first().ffill().reset_index()
For pandas=0.16.1
df.set_index('Date').resample('D').ffill().reset_index()
All produce the same result over this sample data set
you need to add index to the original dataframe before calling reindex
test = pd.DataFrame(np.random.randn(4), index=pd.date_range('2017-01-01', '2017-01-04'), columns=['test'])
test.reindex(pd.date_range('2017-01-01', '2017-01-05'), method='ffill')

Categories

Resources