Fill in missing dates of groupby - python

Imagine I have a dataframe that looks like:
ID DATE VALUE
1 31-01-2006 5
1 28-02-2006 5
1 31-05-2006 10
1 30-06-2006 11
2 31-01-2006 5
2 31-02-2006 5
2 31-03-2006 5
2 31-04-2006 5
As you can see this is panel data with multiple entries on the same date for different IDs. What I want to do is fill in missing dates for each ID. You can see that for ID "1" there is a jump in months between the second and third entry.
I would like a dataframe that looks like:
ID DATE VALUE
1 31-01-2006 5
1 28-02-2006 5
1 31-03-2006 NA
1 30-04-2006 NA
1 31-05-2006 10
1 30-06-2006 11
2 31-01-2006 5
2 31-02-2006 5
2 31-03-2006 5
2 31-04-2006 5
I have no idea how to do this since I can not index by date since there are duplicate dates.

One way is to use pivot_table and then unstack:
In [11]: df.pivot_table("VALUE", "DATE", "ID")
Out[11]:
ID 1 2
DATE
28-02-2006 5.0 NaN
30-06-2006 11.0 NaN
31-01-2006 5.0 5.0
31-02-2006 NaN 5.0
31-03-2006 NaN 5.0
31-04-2006 NaN 5.0
31-05-2006 10.0 NaN
In [12]: df.pivot_table("VALUE", "DATE", "ID").unstack().reset_index()
Out[12]:
ID DATE 0
0 1 28-02-2006 5.0
1 1 30-06-2006 11.0
2 1 31-01-2006 5.0
3 1 31-02-2006 NaN
4 1 31-03-2006 NaN
5 1 31-04-2006 NaN
6 1 31-05-2006 10.0
7 2 28-02-2006 NaN
8 2 30-06-2006 NaN
9 2 31-01-2006 5.0
10 2 31-02-2006 5.0
11 2 31-03-2006 5.0
12 2 31-04-2006 5.0
13 2 31-05-2006 NaN
An alternative, perhaps slightly more efficient way is to reindex from_product:
In [21] df1 = df.set_index(['ID', 'DATE'])
In [22]: df1.reindex(pd.MultiIndex.from_product(df1.index.levels))
Out[22]:
VALUE
1 28-02-2006 5.0
30-06-2006 11.0
31-01-2006 5.0
31-02-2006 NaN
31-03-2006 NaN
31-04-2006 NaN
31-05-2006 10.0
2 28-02-2006 NaN
30-06-2006 NaN
31-01-2006 5.0
31-02-2006 5.0
31-03-2006 5.0
31-04-2006 5.0
31-05-2006 NaN

Another solution is to convert the incomplete data to a "wide" form (a table; this will create cells for the missing values) and then back to a "tall" form.
df.set_index(['ID','DATE']).unstack().stack(dropna=False).reset_index()
# ID DATE VALUE
#0 1 28-02-2006 5.0
#1 1 30-06-2006 11.0
#2 1 31-01-2006 5.0
#3 1 31-02-2006 NaN
#4 1 31-03-2006 NaN
#5 1 31-04-2006 NaN
#6 1 31-05-2006 10.0
#7 2 28-02-2006 NaN
#....

Related

Sales data : Plot number of ordered each item over time

I have dataframe with time-series data and want to plot number of each item over time.
date item ordered
1 01-05-2020 1 1
2 01-05-2020 1 23
3 03-06-2020 2 4
4 03-07-2020 2 5
5 04-09-2020 3 4
df_new = df.groupby(df[['date','item']])['ordered'].sum().reset_index()
df_new.plot()
Use DataFrame.pivot_table before ploting, also dont convert DatetimeIndex to column by reset_index before ploting:
df_new = df.pivot_table(index='date', columns='item', values='ordered', aggfunc='sum')
print (df_new)
item 1 2 3
date
01-05-2020 24.0 NaN NaN
03-06-2020 NaN 4.0 NaN
03-07-2020 NaN 5.0 NaN
04-09-2020 NaN NaN 4.0
df_new.plot()
Your solution:
df_new = df.groupby(['date','item'])['ordered'].sum().unstack()
print (df_new)
item 1 2 3
date
01-05-2020 24.0 NaN NaN
03-06-2020 NaN 4.0 NaN
03-07-2020 NaN 5.0 NaN
04-09-2020 NaN NaN 4.0
df_new.plot()

Maintaining dataframe shape when slicing in pandas

I've imported a .csv into pandas and want to extract specific values and put them into a new column whilst maintaining the existing shape.
So df[::3] extracts the data-
1 1
2 4
3 7
4
5
6
7
I want it to look like
1 1
2
3
4 4
5
6
7 7
Here is a solution:
df = pd.read_csv(r"C:/users/k_sego/colsplit.csv",sep=";")
df1 = df[['col1']]
df2 = df[['col2']]
DF = pd.merge(df1,df2, how='outer',left_on=['col1'],right_on=['col2'])
and the result is
col1 col2
0 1.0 1.0
1 2.0 NaN
2 3.0 NaN
3 4.0 4.0
4 5.0 NaN
5 6.0 NaN
6 7.0 7.0
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN

How to interpolate in Pandas using only previous values?

This is my dataframe:
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
id value
0 1 5
1 1 6
2 1 NaN
3 2 NaN
4 2 8
5 2 4
6 2 NaN
7 2 10
8 3 NaN
This is my expected output:
id value
0 1 5
1 1 6
2 1 7
3 2 NaN
4 2 8
5 2 4
6 2 2
7 2 10
8 3 NaN
This is my current output using this code:
df.value.interpolate(method="krogh")
0 5.000000
1 6.000000
2 9.071429
3 10.171429
4 8.000000
5 4.000000
6 2.357143
7 10.000000
8 36.600000
Basically, I want to do two important things here:
Groupby ID then Interpolate using only above values not below row values
This should do the trick:
df["value_interp"]=df.value.combine_first(df.groupby("id")["value"].apply(lambda y: y.expanding().apply(lambda x: x.interpolate(method="krogh").to_numpy()[-1], raw=False)))
Outputs:
id value value_interp
0 1.0 5.0 5.0
1 1.0 6.0 6.0
2 1.0 NaN 7.0
3 2.0 NaN NaN
4 2.0 8.0 8.0
5 2.0 4.0 4.0
6 2.0 NaN 0.0
7 2.0 10.0 10.0
8 3.0 NaN NaN
(It interpolates based only on the previous values within the group - hence index 6 will return 0 not 2)
You can group by id and then loop over groups to make interpolations. For id = 2 interpolation will not give you value 2
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([ [1,5],[1,6],[1,np.nan],[2,np.nan],[2,8],[2,4],[2,np.nan],[2,10],[3,np.nan]]),columns=['id','value'])
data = []
for name, group in df.groupby('id'):
group_interpolation = group.interpolate(method='krogh', limit_direction='forward', axis=0)
data.append(group_interpolation)
df = (pd.concat(data)).round(1)
Output:
id value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 4.7
7 2.0 10.0
8 3.0 NaN
Current pandas.Series.interpolate does not support what you want so to achieve your goal you need to do 2 grouby's that will account for your desire to use only previous rows. The idea is as follows: to combine into one group only missing value (!!!) and previous rows (it might have limitations if you have several missing values in a row, but it serves well for your toy example)
Suppose we have a df:
print(df)
ID Value
0 1 5.0
1 1 6.0
2 1 NaN
3 2 NaN
4 2 8.0
5 2 4.0
6 2 NaN
7 2 10.0
8 3 NaN
Then we will combine any missing values within a group with previous rows:
df["extrapolate"] = df.groupby("ID")["Value"].apply(lambda grp: grp.isnull().cumsum().shift().bfill())
print(df)
ID Value extrapolate
0 1 5.0 0.0
1 1 6.0 0.0
2 1 NaN 0.0
3 2 NaN 1.0
4 2 8.0 1.0
5 2 4.0 1.0
6 2 NaN 1.0
7 2 10.0 2.0
8 3 NaN NaN
You may see, that when grouped by ["ID","extrapolate"] the missing value will fall into the same group as nonnull values of previous rows.
Now we are ready to do extrapolation (with spline of order=1):
df.groupby(["ID","extrapolate"], as_index=False).apply(lambda grp:grp.interpolate(method="spline",order=1)).drop("extrapolate", axis=1)
ID Value
0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 2.0 NaN
4 2.0 8.0
5 2.0 4.0
6 2.0 0.0
7 2.0 10.0
8 NaN NaN
Hope this helps.

split pandas column with tuple

I have a dictionary of the form;
data = {A:[(1,2),(3,4),(5,6),(7,8),(8,9)],
B:[(3,4),(4,5),(5,6),(6,7)],
C:[(10,11),(12,13)]}
I create a dataFrame by:
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in data.iteritems()]))
which in turn becomes;
A B C
(1,2) (3,4) (10,11)
(3,4) (4,5) (12,13)
(5,6) (5,6) NaN
(6,7) (6,7) NaN
(8,9) NaN NaN
Is there a way to go from the dataframe above to the one below:
A B C
one two one two one two
1 2 3 4 10 11
3 4 4 5 12 13
5 6 5 6 NaN NaN
6 7 6 7 NaN NaN
8 9 NaN NaN NaN NaN
You can use list comprehension with DataFrame constructor with converting columns to numpy array by values + tolist and concat:
cols = ['A','B','C']
L = [pd.DataFrame(df[x].values.tolist(), columns=['one','two']) for x in cols]
df = pd.concat(L, axis=1, keys=cols)
print (df)
A B C
one two one two one two
0 1 2 3 4 5 6
1 7 8 9 10 11 12
2 13 14 15 16 17 18
EDIT:
Similar solution with dict comprehension, integers values was converted to floats, because type of NaN is float too.
data = {'A':[(1,2),(3,4),(5,6),(7,8),(8,9)],
'B':[(3,4),(4,5),(5,6),(6,7)],
'C':[(10,11),(12,13)]}
cols = ['A','B','C']
d = {k: pd.DataFrame(v, columns=['one','two']) for k,v in data.items()}
df = pd.concat(d, axis=1)
print (df)
A B C
one two one two one two
0 1 2 3.0 4.0 10.0 11.0
1 3 4 4.0 5.0 12.0 13.0
2 5 6 5.0 6.0 NaN NaN
3 7 8 6.0 7.0 NaN NaN
4 8 9 NaN NaN NaN NaN
EDIT:
For multiple by one column is possible use slicers:
s = df[('A', 'one')]
print (s)
0 1
1 3
2 5
3 7
4 8
Name: (A, one), dtype: int64
df.loc(axis=1)[:, 'one'] = df.loc(axis=1)[:, 'one'].mul(s, axis=0)
print (df)
A B C
one two one two one two
0 1.0 2 3.0 4.0 10.0 11.0
1 9.0 4 12.0 5.0 36.0 13.0
2 25.0 6 25.0 6.0 NaN NaN
3 49.0 8 42.0 7.0 NaN NaN
4 64.0 9 NaN NaN NaN NaN
Another solution:
idx = pd.IndexSlice
df.loc[:, idx[:, 'one']] = df.loc[:, idx[:, 'one']].mul(s, axis=0)
print (df)
A B C
one two one two one two
0 1.0 2 3.0 4.0 10.0 11.0
1 9.0 4 12.0 5.0 36.0 13.0
2 25.0 6 25.0 6.0 NaN NaN
3 49.0 8 42.0 7.0 NaN NaN
4 64.0 9 NaN NaN NaN NaN

pandas backfill NaN by incrementing the last value

I have a data frame:
A B C
Timestamp
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN 5
4 NaN NaN 4
5 NaN 3 3
6 NaN 2 NaN
7 3 1 NaN
8 2 NaN NaN
9 1 NaN NaN
I would like to backfill it by incrementing the last available value in each column so it looks like this:
A B C
Timestamp
1 9 7 7
2 8 6 6
3 7 5 5
4 6 4 4
5 5 3 3
6 4 2 NaN
7 3 1 NaN
8 2 NaN NaN
9 1 NaN NaN
Let's try this:
df1 = df1[::-1].fillna(method='ffill')
(df1 + (df1 == df1.shift()).cumsum()).sort_index()
Output:
A B C
Timestamp
1 9.0 7.0 7.0
2 8.0 6.0 6.0
3 7.0 5.0 5.0
4 6.0 4.0 4.0
5 5.0 3.0 3.0
6 4.0 2.0 NaN
7 3.0 1.0 NaN
8 2.0 NaN NaN
9 1.0 NaN NaN
You can try this:
def bfill_increment(col):
col_null = col.isnull()[::-1]
groups = col_null.diff().fillna(0).cumsum()
return col_null.groupby(groups).cumsum()[::-1] + col.bfill()
df.apply(bfill_increment)

Categories

Resources