This is what I have:
df = pd.DataFrame({'item': [1,1,2,2,1,1],
'shop': ['A','A','A','A','B','B'],
'date': pd.to_datetime(['2018.01.'+ str(x) for x in [2,3,1,4,4,5]]),
'qty': [5,6,7,8,9,10]})
print(df)
item shop date qty
0 1 A 2018-01-02 5
1 1 A 2018-01-03 6
2 2 A 2018-01-01 7
3 2 A 2018-01-04 8
4 1 B 2018-01-04 9
5 1 B 2018-01-05 10
This is what I want:
out = pd.DataFrame({'item': [1,1,1,1,2,2,2,2,2,1,1],
'shop': ['A','A','A','A','A','A','A','A','A','B','B'],
'date': pd.to_datetime(['2018.01.'+ str(x) for x in [2,3,4,5,1,2,3,4,5,4,5]]),
'qty': [5,6,0,0,7,0,0,8,0,9,10]})
print(out)
item shop date qty
0 1 A 2018-01-02 5
1 1 A 2018-01-03 6
2 1 A 2018-01-04 0
3 1 A 2018-01-05 0
4 2 A 2018-01-01 7
5 2 A 2018-01-02 0
6 2 A 2018-01-03 0
7 2 A 2018-01-04 8
8 2 A 2018-01-05 0
9 1 B 2018-01-04 9
10 1 B 2018-01-05 10
This is what I achieved so far:
df.set_index('date').groupby(['item', 'shop']).resample("D")['qty'].sum().reset_index(name='qty')
item shop date qty
0 1 A 2018-01-02 5
1 1 A 2018-01-03 6
2 1 B 2018-01-04 9
3 1 B 2018-01-05 10
4 2 A 2018-01-01 7
5 2 A 2018-01-02 0
6 2 A 2018-01-03 0
7 2 A 2018-01-04 8
I want to complete the missing dates (by day!) so that each group [item-shop] will end with the same date.
Ideas?
The key here is create the min and max within different group , then we create the range and explode merge back
# find the min date for each shop under each item
s = df.groupby(['item','shop'])[['date']].min()
# find the global max
s['datemax'] = df['date'].max()
# combine two results
s['date'] = [pd.date_range(x,y) for x , y in zip(s['date'],s['datemax'])]
out = s.explode('date').reset_index().merge(df,how='left').fillna(0)
out
item shop date datemax qty
0 1 A 2018-01-02 2018-01-05 5.0
1 1 A 2018-01-03 2018-01-05 6.0
2 1 A 2018-01-04 2018-01-05 0.0
3 1 A 2018-01-05 2018-01-05 0.0
4 1 B 2018-01-04 2018-01-05 9.0
5 1 B 2018-01-05 2018-01-05 10.0
6 2 A 2018-01-01 2018-01-05 7.0
7 2 A 2018-01-02 2018-01-05 0.0
8 2 A 2018-01-03 2018-01-05 0.0
9 2 A 2018-01-04 2018-01-05 8.0
10 2 A 2018-01-05 2018-01-05 0.0
I think this gives you what you want (columns are ordered differently)
max_date = df.date.max()
def reindex_to_max_date(df):
return df.set_index('date').reindex(pd.date_range(df.date.min(), max_date, name='date'), fill_value=0)
res = df.groupby(['shop', 'item']).apply(reindex_to_max_date)
res = res.qty.reset_index()
I grouped by shop, item to give the same sort order as you have in out but these can be swapped.
Not sure if this is the most efficient way but one idea is to create a dataframe with all the dates and do a left join at shop-item level as followinf
Initial data
import pandas as pd
df = pd.DataFrame({'item': [1,1,2,2,1,1],
'shop': ['A','A','A','A','B','B'],
'date': pd.to_datetime(['2018.01.'+ str(x)
for x in [2,3,1,4,4,5]]),
'qty': [5,6,7,8,9,10]})
df = df.set_index('date')\
.groupby(['item', 'shop'])\
.resample("D")['qty']\
.sum()\
.reset_index(name='qty')
Dataframe with all dates
We first get the max and min date
rg = df.agg({"date":{"min", "max"}})
and then we create a df with all possible dates
df_dates = pd.DataFrame(
{"date": pd.date_range(
start=rg["date"]["min"],
end=rg["date"]["max"])
})
Complete dates
Now for every shop item we do a left join with all possible dates
def complete_dates(x, df_dates):
item = x["item"].iloc[0]
shop = x["shop"].iloc[0]
x = pd.merge(df_dates, x,
on=["date"],
how="left")
x["item"] = item
x["shop"] = shop
return x
And we finally apply this function to the original df.
df.groupby(["item", "shop"])\
.apply(lambda x:
complete_dates(x, df_dates)
)\
.reset_index(drop=True)
date item shop qty
0 2018-01-01 1 A NaN
1 2018-01-02 1 A 5.0
2 2018-01-03 1 A 6.0
3 2018-01-04 1 A NaN
4 2018-01-05 1 A NaN
5 2018-01-01 1 B NaN
6 2018-01-02 1 B NaN
7 2018-01-03 1 B NaN
8 2018-01-04 1 B 9.0
9 2018-01-05 1 B 10.0
10 2018-01-01 2 A 7.0
11 2018-01-02 2 A 0.0
12 2018-01-03 2 A 0.0
13 2018-01-04 2 A 8.0
14 2018-01-05 2 A NaN
You could use the complete function from pyjanitor to expose the missing values; the end date is the max of date, the starting date varies per group of item and shop.
Create a dictionary that pairs the target column date to a new date range:
new_date = {"date" : lambda date: pd.date_range(date.min(), df['date'].max())}
Pass the new_date variable to complete :
# pip install https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd
df.complete([new_date], by = ['item', 'shop']).fillna(0)
item shop date qty
0 1 A 2018-01-02 5.0
1 1 A 2018-01-03 6.0
2 1 A 2018-01-04 0.0
3 1 A 2018-01-05 0.0
4 1 B 2018-01-04 9.0
5 1 B 2018-01-05 10.0
6 2 A 2018-01-01 7.0
7 2 A 2018-01-02 0.0
8 2 A 2018-01-03 0.0
9 2 A 2018-01-04 8.0
10 2 A 2018-01-05 0.0
complete is just an abstraction of pandas functions that makes it easier to explicitly expose missing values in a Pandas dataframe.
Related
I have a DataFrame like
In [67]: df
Out[67]:
id ts
0 a 2018-01-01
1 a 2018-01-02
2 a 2018-01-03
3 a 2018-01-04
4 a 2018-01-05
5 a 2018-01-06
6 a 2018-01-07
7 a 2018-01-08
8 b 2018-01-03
9 b 2018-01-04
10 b 2018-01-05
11 b 2018-01-06
12 b 2018-01-07
13 b 2018-01-08
14 b 2018-01-09
15 b 2018-01-10
16 b 2018-01-11
How can I extract the part where a and b has a same ts?
id ts
2 a 2018-01-03
3 a 2018-01-04
4 a 2018-01-05
5 a 2018-01-06
6 a 2018-01-07
7 a 2018-01-08
8 b 2018-01-03
9 b 2018-01-04
10 b 2018-01-05
11 b 2018-01-06
12 b 2018-01-07
13 b 2018-01-08
There might be many unique id beside a and b. I want all the intersection of column ts.
what would be the expected output with an additional row of c 2018-01-04?
It would be
a 2018-01-04
b 2018-01-04
c 2018-01-04
Idea is reshape by DataFrame.pivot_table, so get missing values for different datetimes, remove them by DataFrame.dropna and then filter original DataFrame by Series.isin:
df1 = df.pivot_table(index='ts', columns='id', aggfunc='size').dropna()
df = df[df['ts'].isin(df1.index)]
print (df)
id ts
2 a 2018-01-03
3 a 2018-01-04
4 a 2018-01-05
5 a 2018-01-06
6 a 2018-01-07
7 a 2018-01-08
8 b 2018-01-03
9 b 2018-01-04
10 b 2018-01-05
11 b 2018-01-06
12 b 2018-01-07
13 b 2018-01-08
Test if added new c row:
df1 = df.pivot_table(index='ts', columns='id', aggfunc='size').dropna()
df = df[df['ts'].isin(df1.index)]
print (df)
id ts
3 a 2018-01-04
9 b 2018-01-04
17 c 2018-01-04
To keep only the intersecting values, you could take the groupby.size of ts, and check the which of these groups have a size equal to the amount of unique values in ts. Then use the result to index the dataframe.
Checking on the proposed dataframe, and the additional row c 2018-01-04, this would return only the intersecting dates in ts:
s = df.groupby(df.ts).size().eq(df.id.nunique())
df[df.ts.isin(s[s].index)]
id ts
3 a 2018-01-04
9 b 2018-01-04
16 c 2018-01-04
I have a dataframe like below. Each date is Monday of each week.
df = pd.DataFrame({'date' :['2020-04-20', '2020-05-11','2020-05-18',
'2020-04-20', '2020-04-27','2020-05-04','2020-05-18'],
'name': ['A', 'A', 'A', 'B', 'B', 'B', 'B'],
'count': [23, 44, 125, 6, 9, 10, 122]})
date name count
0 2020-04-20 A 23
1 2020-05-11 A 44
2 2020-05-18 A 125
3 2020-04-20 B 6
4 2020-04-27 B 9
5 2020-05-04 B 10
6 2020-05-18 B 122
Neither 'A' and 'B' covers the whole date range. Both of them have some missing dates, which means the counts on that week is 0. Below is all the dates:
df_dates = pd.DataFrame({ 'date':['2020-04-20', '2020-04-27','2020-05-04','2020-05-11','2020-05-18'] })
So what I need is like the dataframe below:
date name count
0 2020-04-20 A 23
1 2020-04-27 A 0
2 2020-05-04 A 0
3 2020-05-11 A 44
4 2020-05-18 A 125
5 2020-04-20 B 6
6 2020-04-27 B 9
7 2020-05-04 B 10
8 2020-05-11 B 0
9 2020-05-18 B 122
It seems like I need to join (merge) df_dates with df for each name group ( A and B) and then fill the data with missing name and missing count value with 0's. Does anyone know achieve that? how I can join with another table with a grouped table?
I tried and no luck...
pd.merge(df_dates, df.groupby('name'), how='left', on='date')
We can do reindex with multiple index creation
idx=pd.MultiIndex.from_product([df_dates.date,df.name.unique()],names=['date','name'])
s=df.set_index(['date','name']).reindex(idx,fill_value=0).reset_index().sort_values('name')
Out[136]:
date name count
0 2020-04-20 A 23
2 2020-04-27 A 0
4 2020-05-04 A 0
6 2020-05-11 A 44
8 2020-05-18 A 125
1 2020-04-20 B 6
3 2020-04-27 B 9
5 2020-05-04 B 10
7 2020-05-11 B 0
9 2020-05-18 B 122
Or
s=df.pivot(*df.columns).reindex(df_dates.date).fillna(0).reset_index().melt('date')
Out[145]:
date name value
0 2020-04-20 A 23.0
1 2020-04-27 A 0.0
2 2020-05-04 A 0.0
3 2020-05-11 A 44.0
4 2020-05-18 A 125.0
5 2020-04-20 B 6.0
6 2020-04-27 B 9.0
7 2020-05-04 B 10.0
8 2020-05-11 B 0.0
9 2020-05-18 B 122.0
If you are looking for just fill in the union of dates in df, you can do:
(df.set_index(['date','name'])
.unstack('date',fill_value=0)
.stack().reset_index()
)
Output:
name date count
0 A 2020-04-20 23
1 A 2020-04-27 0
2 A 2020-05-04 0
3 A 2020-05-11 44
4 A 2020-05-18 125
5 B 2020-04-20 6
6 B 2020-04-27 9
7 B 2020-05-04 10
8 B 2020-05-11 0
9 B 2020-05-18 122
I have a dataframe with timeseries of sales transactions for different items:
import pandas as pd
from datetime import timedelta
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_3 = pd.DataFrame()
# Create datetimes and data
df_1['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_1['item'] = 1
df_1['sales']= 2
df_2['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_2['item'] = 2
df_2['sales']= 3
df_3['date'] = pd.date_range('1/1/2018', periods=5, freq='D')
df_3['item'] = 3
df_3['sales']= 4
df = pd.concat([df_1, df_2, df_3])
df = df.sort_values(['item'])
df
Resulting dataframe:
date item sales
0 2018-01-01 1 2
1 2018-01-02 1 2
2 2018-01-03 1 2
3 2018-01-04 1 2
4 2018-01-05 1 2
0 2018-01-01 2 3
1 2018-01-02 2 3
2 2018-01-03 2 3
3 2018-01-04 2 3
4 2018-01-05 2 3
0 2018-01-01 3 4
1 2018-01-02 3 4
2 2018-01-03 3 4
3 2018-01-04 3 4
4 2018-01-05 3 4
I want to compute a sum of "sales" for a given item in a given time window. I can't use pandas rolling.sum
because the timeseries is sparse (eg. 2018-01-01 > 2018-01-04 > 2018-01-06 > etc.).
I've tried this solution (for time window = 2 days):
df['start_date'] = df['date'] - timedelta(3)
df['end_date'] = df['date'] - timedelta(1)
df['rolled_sales'] = df.apply(lambda x: df.loc[(df.date >= x.start_date) &
(df.date <= x.end_date), 'sales'].sum(), axis=1)
but it results with sums of sales of all items for a given time window:
date item sales start_date end_date rolled_sales
0 2018-01-01 1 2 2017-12-29 2017-12-31 0
1 2018-01-02 1 2 2017-12-30 2018-01-01 9
2 2018-01-03 1 2 2017-12-31 2018-01-02 18
3 2018-01-04 1 2 2018-01-01 2018-01-03 27
4 2018-01-05 1 2 2018-01-02 2018-01-04 27
0 2018-01-01 2 3 2017-12-29 2017-12-31 0
1 2018-01-02 2 3 2017-12-30 2018-01-01 9
2 2018-01-03 2 3 2017-12-31 2018-01-02 18
3 2018-01-04 2 3 2018-01-01 2018-01-03 27
4 2018-01-05 2 3 2018-01-02 2018-01-04 27
0 2018-01-01 3 4 2017-12-29 2017-12-31 0
1 2018-01-02 3 4 2017-12-30 2018-01-01 9
2 2018-01-03 3 4 2017-12-31 2018-01-02 18
3 2018-01-04 3 4 2018-01-01 2018-01-03 27
4 2018-01-05 3 4 2018-01-02 2018-01-04 27
My goal is to have rolled_sales computed for each item separately, like this:
date item sales start_date end_date rolled_sales
0 2018-01-01 1 2 2017-12-29 2017-12-31 0
1 2018-01-02 1 2 2017-12-30 2018-01-01 2
2 2018-01-03 1 2 2017-12-31 2018-01-02 4
3 2018-01-04 1 2 2018-01-01 2018-01-03 6
4 2018-01-05 1 2 2018-01-02 2018-01-04 8
0 2018-01-01 2 3 2017-12-29 2017-12-31 0
1 2018-01-02 2 3 2017-12-30 2018-01-01 3
2 2018-01-03 2 3 2017-12-31 2018-01-02 6
3 2018-01-04 2 3 2018-01-01 2018-01-03 9
4 2018-01-05 2 3 2018-01-02 2018-01-04 12
0 2018-01-01 3 4 2017-12-29 2017-12-31 0
1 2018-01-02 3 4 2017-12-30 2018-01-01 4
2 2018-01-03 3 4 2017-12-31 2018-01-02 8
3 2018-01-04 3 4 2018-01-01 2018-01-03 12
4 2018-01-05 3 4 2018-01-02 2018-01-04 16
I tried to apply solution suggested here: Pandas rolling sum for multiply values separately
but failed.
Any ideas?
Many Thanks in advance :)
Andy
Total sales With 2-days rolling window per item:
z = df.sort_values('date').set_index('date').groupby('item').rolling('2d')['sales'].sum()
Output:
item date
1 2018-01-01 2.0
2018-01-02 4.0
2018-01-03 4.0
2018-01-04 4.0
2018-01-05 4.0
2 2018-01-01 3.0
2018-01-02 6.0
2018-01-03 6.0
2018-01-04 6.0
2018-01-05 6.0
3 2018-01-01 4.0
2018-01-02 8.0
2018-01-03 8.0
2018-01-04 8.0
2018-01-05 8.0
Name: sales, dtype: float64
Total sales from last 2 days per item:
df[df.groupby('item').cumcount() < 2 ].groupby('item').sum()
Total sales between start_date and end_date per item:
start_date = pd.to_datetime('2017-12-2')
end_date = pd.to_datetime('2018-12-2')
df[df['date'].between(start_date, end_date)].groupby('item')['sales'].sum()
df['rolled_sum'] = (df.groupby('item')
.rolling('3D', on='date').sum()['sales']
.to_numpy()
)
After some data wrangling (I removed some rows to simulate sparse dates, and added helper columns "start_date" and "end_date" for 3 days distance from a given date), the final output looks like this:
date item sales start_date end_date rolled_sum
0 2018-01-01 1 2 2017-12-30 2018-01-01 2.0
3 2018-01-04 1 2 2018-01-02 2018-01-04 2.0
4 2018-01-05 1 2 2018-01-03 2018-01-05 4.0
7 2018-01-08 1 2 2018-01-06 2018-01-08 2.0
9 2018-01-10 1 2 2018-01-08 2018-01-10 4.0
12 2018-01-03 2 3 2018-01-01 2018-01-03 3.0
13 2018-01-04 2 3 2018-01-02 2018-01-04 6.0
15 2018-01-06 2 3 2018-01-04 2018-01-06 6.0
17 2018-01-08 2 3 2018-01-06 2018-01-08 6.0
18 2018-01-09 2 3 2018-01-07 2018-01-09 6.0
19 2018-01-10 2 3 2018-01-08 2018-01-10 9.0
21 2018-01-02 3 4 2017-12-31 2018-01-02 4.0
23 2018-01-04 3 4 2018-01-02 2018-01-04 8.0
25 2018-01-06 3 4 2018-01-04 2018-01-06 8.0
26 2018-01-07 3 4 2018-01-05 2018-01-07 8.0
27 2018-01-08 3 4 2018-01-06 2018-01-08 12.0
28 2018-01-09 3 4 2018-01-07 2018-01-09 12.0
29 2018-01-10 3 4 2018-01-08 2018-01-10 12.0
The magic was in rolling.sum parameter: instead of "3", I should use "3D".
Many Thanks for Your help :)
Andy
My Dataframe df3 looks something like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50
...
11 11 2018-01-01 00:00:07.523 125.5 120
12 12 2018-01-01 00:00:08.757 125.0 120
13 13 2018-01-04 00:00:14.507 127.0 300
14 14 2018-01-04 00:00:15.743 126.5 300
15 15 2018-01-05 00:00:19.407 125.5 350
I wanted to resample using ffill for every second so that it looks like this:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:06.000 125.00 101
1 2 2018-01-01 00:00:07.000 125.00 101
2 3 2018-01-01 00:00:08.000 125.00 101
3 4 2018-01-02 00:00:09.000 125.00 52
4 5 2018-01-02 00:00:10.000 127.00 52
...
My code:
def resample(df):
indexing = df[['Timestamp','Data']]
indexing['Timestamp']=pd.to_datetime(indexing['Timestamp'])
indexing =indexing.set_index('Timestamp')
indexing1= indexing.resample('1S',fill_method='ffill')
# indexing1 = indexing1.resample('D')
return indexing1
indexing = resample(df3)
but incurred error
ValueError: cannot reindex a non-unique index with a method or limit
I don't quite understand what this error mean. #jezrael from this similar question suggested using drop_duplicates with groupby. I am not sure what this does to the data as it seems there are no duplicates in my data? Can someone explain this please? Thanks.
This error is caused because of the following:
Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
When you resample both these timestamps to the nearest second they both become
2018-01-01 00:00:06 and pandas doesn't know which value for the data to pick
because it has two to select from. Instead what you can do is use an aggregation function
such as last (though mean, max, min may also be suitable) in order to
select one of the values. Then you can apply the forward fill.
Example:
from io import StringIO
import pandas as pd
df = pd.read_table(StringIO(""" Id Timestamp Data Group_Id
0 1 2018-01-01 00:00:05.523 125.5 101
1 2 2018-01-01 00:00:05.757 125.0 101
2 3 2018-01-02 00:00:09.507 127.0 52
3 4 2018-01-02 00:00:13.743 126.5 52
4 5 2018-01-03 00:00:15.407 125.5 50"""), sep='\s\s+')
df['Timestamp'] = pd.to_datetime(df['Timestamp']).dt.round('s')
df.set_index('Timestamp', inplace=True)
df = df.resample('1S').last().ffill()
In Python and Pandas, I have one dataframe for 2018 which looks like this:
Date Stock_id Stock_value
02/01/2018 1 4
03/01/2018 1 2
05/01/2018 1 7
01/01/2018 2 6
02/01/2018 2 9
03/01/2018 2 4
04/01/2018 2 6
and a dataframe with one column which has all the 2018 dates like the following:
Date
01/01/2018
02/01/2018
03/01/2018
04/01/2018
05/01/2018
06/01/2018
etc
I want to merge these to get my first dataframe with full dates for 2018 for each stock and with NAs wherever they were not any data.
Basically, I want to have for each stock a row for each date of 2018 (where the rows which do not have any data should filled in with NAs).
Thus, I want to have the following as an output for the sample above:
Date Stock_id Stock_value
01/01/2018 1 NA
02/01/2018 1 4
03/01/2018 1 2
04/01/2018 1 NA
05/01/2018 1 7
01/01/2018 2 6
02/01/2018 2 9
03/01/2018 2 4
04/01/2018 2 6
05/01/2018 2 NA
How can I do this?
I tested
data = data_1.merge(data_2, on='Date' , how='outer')
and
data = data_1.merge(data_2, on='Date' , how='right')
but I still got the original dataframe with no new dates added but only with some rows which had everywhere NAs added.
Use product for all combinations of values with Stock_id and merge with left join:
df1['Date'] = pd.to_datetime(df1['Date'], dayfirst=True)
df2['Date'] = pd.to_datetime(df2['Date'], dayfirst=True)
from itertools import product
c = ['Stock_id','Date']
df = pd.DataFrame(list(product(df1['Stock_id'].unique(), df2['Date'])), columns=c)
print (df)
Stock_id Date
0 1 2018-01-01
1 1 2018-01-02
2 1 2018-01-03
3 1 2018-01-04
4 1 2018-01-05
5 1 2018-01-06
6 2 2018-01-01
7 2 2018-01-02
8 2 2018-01-03
9 2 2018-01-04
10 2 2018-01-05
11 2 2018-01-06
and
df = df[['Date','Stock_id']].merge(df1, how='left')
#if necessary specify both columns
#df = df[['Date','Stock_id']].merge(df1, how='left', on=['Date','Stock_id'])
print (df)
Date Stock_id Stock_value
0 2018-01-01 1 NaN
1 2018-01-02 1 4.0
2 2018-01-03 1 2.0
3 2018-01-04 1 NaN
4 2018-01-05 1 7.0
5 2018-01-06 1 NaN
6 2018-01-01 2 6.0
7 2018-01-02 2 9.0
8 2018-01-03 2 4.0
9 2018-01-04 2 6.0
10 2018-01-05 2 NaN
11 2018-01-06 2 NaN
Another idea, but should be slow in large data:
df = (df1.groupby('Stock_id')[['Date','Stock_value']]
.apply(lambda x: x.set_index('Date').reindex(df2['Date']))
.reset_index())
print (df)
Stock_id Date Stock_value
0 1 2018-01-01 NaN
1 1 2018-01-02 4.0
2 1 2018-01-03 2.0
3 1 2018-01-04 NaN
4 1 2018-01-05 7.0
5 1 2018-01-06 NaN
6 2 2018-01-01 6.0
7 2 2018-01-02 9.0
8 2 2018-01-03 4.0
9 2 2018-01-04 6.0
10 2 2018-01-05 NaN
11 2 2018-01-06 NaN