Fill in missing dates of groupby

Fill in missing dates of groupby - python

Imagine I have a dataframe that looks like:
ID DATE VALUE
1 31-01-2006 5
1 28-02-2006 5
1 31-05-2006 10
1 30-06-2006 11
2 31-01-2006 5
2 31-02-2006 5
2 31-03-2006 5
2 31-04-2006 5
As you can see this is panel data with multiple entries on the same date for different IDs. What I want to do is fill in missing dates for each ID. You can see that for ID "1" there is a jump in months between the second and third entry.
I would like a dataframe that looks like:
ID DATE VALUE
1 31-01-2006 5
1 28-02-2006 5
1 31-03-2006 NA
1 30-04-2006 NA
1 31-05-2006 10
1 30-06-2006 11
2 31-01-2006 5
2 31-02-2006 5
2 31-03-2006 5
2 31-04-2006 5
I have no idea how to do this since I can not index by date since there are duplicate dates.

One way is to use pivot_table and then unstack:
In [11]: df.pivot_table("VALUE", "DATE", "ID")
Out[11]:
ID 1 2
DATE
28-02-2006 5.0 NaN
30-06-2006 11.0 NaN
31-01-2006 5.0 5.0
31-02-2006 NaN 5.0
31-03-2006 NaN 5.0
31-04-2006 NaN 5.0
31-05-2006 10.0 NaN
In [12]: df.pivot_table("VALUE", "DATE", "ID").unstack().reset_index()
Out[12]:
ID DATE 0
0 1 28-02-2006 5.0
1 1 30-06-2006 11.0
2 1 31-01-2006 5.0
3 1 31-02-2006 NaN
4 1 31-03-2006 NaN
5 1 31-04-2006 NaN
6 1 31-05-2006 10.0
7 2 28-02-2006 NaN
8 2 30-06-2006 NaN
9 2 31-01-2006 5.0
10 2 31-02-2006 5.0
11 2 31-03-2006 5.0
12 2 31-04-2006 5.0
13 2 31-05-2006 NaN
An alternative, perhaps slightly more efficient way is to reindex from_product:
In [21] df1 = df.set_index(['ID', 'DATE'])
In [22]: df1.reindex(pd.MultiIndex.from_product(df1.index.levels))
Out[22]:
VALUE
1 28-02-2006 5.0
30-06-2006 11.0
31-01-2006 5.0
31-02-2006 NaN
31-03-2006 NaN
31-04-2006 NaN
31-05-2006 10.0
2 28-02-2006 NaN
30-06-2006 NaN
31-01-2006 5.0
31-02-2006 5.0
31-03-2006 5.0
31-04-2006 5.0
31-05-2006 NaN

Another solution is to convert the incomplete data to a "wide" form (a table; this will create cells for the missing values) and then back to a "tall" form.
df.set_index(['ID','DATE']).unstack().stack(dropna=False).reset_index()
# ID DATE VALUE
#0 1 28-02-2006 5.0
#1 1 30-06-2006 11.0
#2 1 31-01-2006 5.0
#3 1 31-02-2006 NaN
#4 1 31-03-2006 NaN
#5 1 31-04-2006 NaN
#6 1 31-05-2006 10.0
#7 2 28-02-2006 NaN
#....

Related

Sales data : Plot number of ordered each item over time

I have dataframe with time-series data and want to plot number of each item over time.
date item ordered
1 01-05-2020 1 1
2 01-05-2020 1 23
3 03-06-2020 2 4
4 03-07-2020 2 5
5 04-09-2020 3 4
df_new = df.groupby(df[['date','item']])['ordered'].sum().reset_index()
df_new.plot()

Use DataFrame.pivot_table before ploting, also dont convert DatetimeIndex to column by reset_index before ploting:
df_new = df.pivot_table(index='date', columns='item', values='ordered', aggfunc='sum')
print (df_new)
item 1 2 3
date
01-05-2020 24.0 NaN NaN
03-06-2020 NaN 4.0 NaN
03-07-2020 NaN 5.0 NaN
04-09-2020 NaN NaN 4.0
df_new.plot()
Your solution:
df_new = df.groupby(['date','item'])['ordered'].sum().unstack()
print (df_new)
item 1 2 3
date
01-05-2020 24.0 NaN NaN
03-06-2020 NaN 4.0 NaN
03-07-2020 NaN 5.0 NaN
04-09-2020 NaN NaN 4.0
df_new.plot()

Maintaining dataframe shape when slicing in pandas

I've imported a .csv into pandas and want to extract specific values and put them into a new column whilst maintaining the existing shape.
So df[::3] extracts the data-
1 1
2 4
3 7
4
5
6
7
I want it to look like
1 1
2
3
4 4
5
6
7 7

Here is a solution:
df = pd.read_csv(r"C:/users/k_sego/colsplit.csv",sep=";")
df1 = df[['col1']]
df2 = df[['col2']]
DF = pd.merge(df1,df2, how='outer',left_on=['col1'],right_on=['col2'])
and the result is
col1 col2
0 1.0 1.0
1 2.0 NaN
2 3.0 NaN
3 4.0 4.0
4 5.0 NaN
5 6.0 NaN
6 7.0 7.0
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN

How to interpolate in Pandas using only previous values?

This is my dataframe:
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
id value
0 1 5
1 1 6
2 1 NaN
3 2 NaN
4 2 8
5 2 4
6 2 NaN
7 2 10
8 3 NaN
This is my expected output:
id value
0 1 5
1 1 6
2 1 7
3 2 NaN
4 2 8
5 2 4
6 2 2
7 2 10
8 3 NaN
This is my current output using this code:
df.value.interpolate(method="krogh")
0 5.000000
1 6.000000
2 9.071429
3 10.171429
4 8.000000
5 4.000000
6 2.357143
7 10.000000
8 36.600000
Basically, I want to do two important things here:
Groupby ID then Interpolate using only above values not below row values

This should do the trick:
df["value_interp"]=df.value.combine_first(df.groupby("id")["value"].apply(lambda y: y.expanding().apply(lambda x: x.interpolate(method="krogh").to_numpy()[-1], raw=False)))
Outputs:
id value value_interp
0 1.0 5.0 5.0
1 1.0 6.0 6.0
2 1.0 NaN 7.0
3 2.0 NaN NaN
4 2.0 8.0 8.0
5 2.0 4.0 4.0
6 2.0 NaN 0.0
7 2.0 10.0 10.0
8 3.0 NaN NaN
(It interpolates based only on the previous values within the group - hence index 6 will return 0 not 2)

You can group by id and then loop over groups to make interpolations. For id = 2 interpolation will not give you value 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
data = []
for name, group in df.groupby('id'):
group_interpolation = group.interpolate(method='krogh', limit_direction='forward', axis=0)
data.append(group_interpolation)
df = (pd.concat(data)).round(1)
Output:
id value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 4.7
7 2.0 10.0
8 3.0 NaN

Current pandas.Series.interpolate does not support what you want so to achieve your goal you need to do 2 grouby's that will account for your desire to use only previous rows. The idea is as follows: to combine into one group only missing value (!!!) and previous rows (it might have limitations if you have several missing values in a row, but it serves well for your toy example)
Suppose we have a df:
print(df)
ID Value
0 1 5.0
1 1 6.0
2 1 NaN
3 2 NaN
4 2 8.0
5 2 4.0
6 2 NaN
7 2 10.0
8 3 NaN
Then we will combine any missing values within a group with previous rows:
df["extrapolate"] = df.groupby("ID")["Value"].apply(lambda grp: grp.isnull().cumsum().shift().bfill())
print(df)
ID Value extrapolate
0 1 5.0 0.0
1 1 6.0 0.0
2 1 NaN 0.0
3 2 NaN 1.0
4 2 8.0 1.0
5 2 4.0 1.0
6 2 NaN 1.0
7 2 10.0 2.0
8 3 NaN NaN
You may see, that when grouped by ["ID","extrapolate"] the missing value will fall into the same group as nonnull values of previous rows.
Now we are ready to do extrapolation (with spline of order=1):
df.groupby(["ID","extrapolate"], as_index=False).apply(lambda grp:grp.interpolate(method="spline",order=1)).drop("extrapolate", axis=1)
ID Value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 0.0
7 2.0 10.0
8 NaN NaN
Hope this helps.

split pandas column with tuple

I have a dictionary of the form;
data = {A:[(1,2),(3,4),(5,6),(7,8),(8,9)],
B:[(3,4),(4,5),(5,6),(6,7)],
C:[(10,11),(12,13)]}
I create a dataFrame by:
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in data.iteritems()]))
which in turn becomes;
A B C
(1,2) (3,4) (10,11)
(3,4) (4,5) (12,13)
(5,6) (5,6) NaN
(6,7) (6,7) NaN
(8,9) NaN NaN
Is there a way to go from the dataframe above to the one below:
A B C
one two one two one two
1 2 3 4 10 11
3 4 4 5 12 13
5 6 5 6 NaN NaN
6 7 6 7 NaN NaN
8 9 NaN NaN NaN NaN

You can use list comprehension with DataFrame constructor with converting columns to numpy array by values + tolist and concat:
cols = ['A','B','C']
L = [pd.DataFrame(df[x].values.tolist(), columns=['one','two']) for x in cols]
df = pd.concat(L, axis=1, keys=cols)
print (df)
A B C
one two one two one two
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
EDIT:
Similar solution with dict comprehension, integers values was converted to floats, because type of NaN is float too.
data = {'A':[(1,2),(3,4),(5,6),(7,8),(8,9)],
'B':[(3,4),(4,5),(5,6),(6,7)],
'C':[(10,11),(12,13)]}
cols = ['A','B','C']
d = {k: pd.DataFrame(v, columns=['one','two']) for k,v in data.items()}
df = pd.concat(d, axis=1)
print (df)
A B C
one two one two one two
0 1 2 3.0 4.0 10.0 11.0
1 3 4 4.0 5.0 12.0 13.0
2 5 6 5.0 6.0 NaN NaN
3 7 8 6.0 7.0 NaN NaN
4 8 9 NaN NaN NaN NaN
EDIT:
For multiple by one column is possible use slicers:
s = df[('A', 'one')]
print (s)
0 1
1 3
2 5
3 7
4 8
Name: (A, one), dtype: int64
df.loc(axis=1)[:, 'one'] = df.loc(axis=1)[:, 'one'].mul(s, axis=0)
print (df)
A B C
one two one two one two
0 1.0 2 3.0 4.0 10.0 11.0
1 9.0 4 12.0 5.0 36.0 13.0
2 25.0 6 25.0 6.0 NaN NaN
3 49.0 8 42.0 7.0 NaN NaN
4 64.0 9 NaN NaN NaN NaN
Another solution:
idx = pd.IndexSlice
df.loc[:, idx[:, 'one']] = df.loc[:, idx[:, 'one']].mul(s, axis=0)
print (df)
A B C
one two one two one two
0 1.0 2 3.0 4.0 10.0 11.0
1 9.0 4 12.0 5.0 36.0 13.0
2 25.0 6 25.0 6.0 NaN NaN
3 49.0 8 42.0 7.0 NaN NaN
4 64.0 9 NaN NaN NaN NaN

pandas backfill NaN by incrementing the last value

I have a data frame:
A B C
Timestamp
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN 5
4 NaN NaN 4
5 NaN 3 3
6 NaN 2 NaN
7 3 1 NaN
8 2 NaN NaN
9 1 NaN NaN
I would like to backfill it by incrementing the last available value in each column so it looks like this:
A B C
Timestamp
1 9 7 7
2 8 6 6
3 7 5 5
4 6 4 4
5 5 3 3
6 4 2 NaN
7 3 1 NaN
8 2 NaN NaN
9 1 NaN NaN

Let's try this:
df1 = df1[::-1].fillna(method='ffill')
(df1 + (df1 == df1.shift()).cumsum()).sort_index()
Output:
A B C
Timestamp
1 9.0 7.0 7.0
2 8.0 6.0 6.0
3 7.0 5.0 5.0
4 6.0 4.0 4.0
5 5.0 3.0 3.0
6 4.0 2.0 NaN
7 3.0 1.0 NaN
8 2.0 NaN NaN
9 1.0 NaN NaN

You can try this:
def bfill_increment(col):
col_null = col.isnull()[::-1]
groups = col_null.diff().fillna(0).cumsum()
return col_null.groupby(groups).cumsum()[::-1] + col.bfill()
df.apply(bfill_increment)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fill in missing dates of groupby - python

Related

Sales data : Plot number of ordered each item over time

Maintaining dataframe shape when slicing in pandas

How to interpolate in Pandas using only previous values?

split pandas column with tuple

pandas backfill NaN by incrementing the last value

Categories

Resources