Create Multi-Index empty DataFrame to join with main DataFrame [duplicate] - python

This question already has answers here:
Pandas filling missing dates and values within group
(3 answers)
Closed 3 months ago.
The community reviewed whether to reopen this question 3 months ago and left it closed:
Original close reason(s) were not resolved
Suppose that I have a dataframe which can be created using code below
df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-05'],
'product':['A', 'A', 'A', 'B', 'B', 'B'],
'price':[10, 20, 30, 40, 50, 60]
}
)
df['date'] = pd.to_datetime(df['date'])
I want to create an empty dataframe let's say main_df which will contain all dates between df.date.min() and df.date.max() for each product and on days where values in nan I want to ffill and bfill for remaning. The resulting dataframe would be as below:
+------------+---------+-------+
| date | product | price |
+------------+---------+-------+
| 2021-01-01 | A | 10 |
| 2021-01-02 | A | 20 |
| 2021-01-03 | A | 20 |
| 2021-01-04 | A | 20 |
| 2021-01-05 | A | 30 |
| 2021-01-01 | B | 40 |
| 2021-01-02 | B | 40 |
| 2021-01-03 | B | 50 |
| 2021-01-04 | B | 50 |
| 2021-01-05 | B | 60 |
+------------+---------+-------+

First
make pivot table, upsampling by asfreq and fill null
df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
output:
product A B
date
2021-01-01 10.0 40.0
2021-01-02 20.0 40.0
2021-01-03 20.0 50.0
2021-01-04 20.0 50.0
2021-01-05 30.0 60.0
Second
stack result and so on (include full code)
(df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
.stack().reset_index().rename(columns={0:'price'})
.sort_values('product').reset_index(drop=True))
output:
date product price
0 2021-01-01 A 10.0
1 2021-01-02 A 20.0
2 2021-01-03 A 20.0
3 2021-01-04 A 20.0
4 2021-01-05 A 30.0
5 2021-01-01 B 40.0
6 2021-01-02 B 40.0
7 2021-01-03 B 50.0
8 2021-01-04 B 50.0
9 2021-01-05 B 60.0

Using resample
df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-06'],
'product':['A', 'A', 'A', 'B', 'B', 'B'],
'price':[10, 20, 30, 40, 50, 60]
}
)
df['date'] = pd.to_datetime(df['date'])
df
# Out:
# date product price
# 0 2021-01-01 A 10
# 1 2021-01-02 A 20
# 2 2021-01-05 A 30
# 3 2021-01-02 B 40
# 4 2021-01-03 B 50
# 5 2021-01-06 B 60
df.set_index("date").groupby("product")["price"].resample("d").ffill().reset_index()
# Out:
# product date price
# 0 A 2021-01-01 10
# 1 A 2021-01-02 20
# 2 A 2021-01-03 20
# 3 A 2021-01-04 20
# 4 A 2021-01-05 30
# 5 B 2021-01-02 40
# 6 B 2021-01-03 50
# 7 B 2021-01-04 50
# 8 B 2021-01-05 50
# 9 B 2021-01-06 60
See the rows that have been filled by ffill:
df.set_index("date").groupby("product")["price"].resample("d").mean()
# Out:
# product date
# A 2021-01-01 10.0
# 2021-01-02 20.0
# 2021-01-03 NaN
# 2021-01-04 NaN
# 2021-01-05 30.0
# B 2021-01-02 40.0
# 2021-01-03 50.0
# 2021-01-04 NaN
# 2021-01-05 NaN
# 2021-01-06 60.0
# Name: price, dtype: float64
Note that by grouping by product before resampling and filling the empty slots, you can have different ranges (from min to max) for each product (I modified the data to showcase this).

Related

Replace nan values with data from previous months

I have a DataFrame as follows. This DataFrame contains NAN values. I want to replace nan values with the earlier non nan value in my DataFrame from previous month(s):
date (y-d-m) | value
2022-01-01 | 1
2022-02-01 | 2
2022-03-01 | 3
2022-04-01 | 4
...
2022-01-02 | nan
2022-02-02 | nan
2022-03-02 | nan
2022-04-02 | nan
...
2022-01-03 | nan
2022-02-03 | nan
2022-03-03 | nan
2022-04-03 | nan
Desired outcome
date (y-d-m) | value
2022-01-01 | 1
2022-02-01 | 2
2022-03-01 | 3
2022-04-01 | 4
...
2022-01-02 | 1
2022-02-02 | 2
2022-03-02 | 3
2022-04-02 | 4
...
2022-01-03 | 1
2022-02-03 | 2
2022-03-03 | 3
2022-04-03 | 4
Data:
{'date (y-d-m)': ['2022-01-01', '2022-02-01', '2022-03-01', '2022-04-01',
'2022-01-02', '2022-02-02', '2022-03-02', '2022-04-02',
'2022-01-03', '2022-02-03', '2022-03-03', '2022-04-03'],
'value': [1.0, 2.0, 3.0, 4.0, nan, nan, nan, nan, nan, nan, nan, nan]}
You could convert "date (y-d-m)" column to datetime; then groupby "day" and forward fill with ffill (values from previous months' same day):
df['date (y-d-m)'] = pd.to_datetime(df['date (y-d-m)'], format='%Y-%d-%m')
df['value'] = df.groupby(df['date (y-d-m)'].dt.day)['value'].ffill()
Output:
date (y-d-m) value
0 2022-01-01 1.0
1 2022-01-02 2.0
2 2022-01-03 3.0
3 2022-01-04 4.0
4 2022-02-01 1.0
5 2022-02-02 2.0
6 2022-02-03 3.0
7 2022-02-04 4.0
8 2022-03-01 1.0
9 2022-03-02 2.0
10 2022-03-03 3.0
11 2022-03-04 4.0

Pandas fill missing dates and values simultaneously for each group

I have a dataframe (mydf) with dates for each group in monthly frequency like below:
Dt Id Sales
2021-03-01 B 2
2021-04-01 B 42
2021-05-01 B 20
2021-06-01 B 4
2020-10-01 A 47
2020-11-01 A 67
2020-12-01 A 46
I want to fill the dt for each group till the Maximum date within the date column starting from the date of Id while simultaneously filling in 0 for the Sales column. So each group starts at their own start date but ends at the same end date.
So for e.g. ID=A will start from 2020-10-01 and go all the way to 2021-06-03 and the value for the filled dates will be 0.
So the output will be
Dt Id Sales
2021-03-01 B 2
2021-04-01 B 42
2021-05-01 B 20
2021-06-01 B 4
2020-10-01 A 46
2020-11-01 A 47
2020-12-01 A 67
2021-01-01 A 0
2021-02-01 A 0
2021-03-01 A 0
2021-04-01 A 0
2021-05-01 A 0
2021-06-01 A 0
I have tried reindex but instead of adding daterange manually I want to use the dates in the groups.
My code is :
f = lambda x: x.reindex(pd.date_range('2020-10-01', '2021-06-01', freq='MS', name='Dt'))
mydf = mydf.set_index('Dt').groupby('Id').apply(f).drop('Id', axis=1).fillna(0)
mydf = mydf.reset_index()
Let's try:
Getting the minimum value per group using groupby.min
Add a new column to the aggregated mins called max which stores the maximum values from the frame using Series.max on Dt
Create individual date_range per group based on the min and max values
Series.explode into rows to have a DataFrame that represents the new index.
Create a MultiIndex.from_frame to reindex the DataFrame with.
reindex with midx and set the fillvalue=0
# Get Min Per Group
dates = mydf.groupby('Id')['Dt'].min().to_frame(name='min')
# Get max from Frame
dates['max'] = mydf['Dt'].max()
# Create MultiIndex with separate Date ranges per Group
midx = pd.MultiIndex.from_frame(
dates.apply(
lambda x: pd.date_range(x['min'], x['max'], freq='MS'), axis=1
).explode().reset_index(name='Dt')[['Dt', 'Id']]
)
# Reindex
mydf = (
mydf.set_index(['Dt', 'Id'])
.reindex(midx, fill_value=0)
.reset_index()
)
mydf:
Dt Id Sales
0 2020-10-01 A 47
1 2020-11-01 A 67
2 2020-12-01 A 46
3 2021-01-01 A 0
4 2021-02-01 A 0
5 2021-03-01 A 0
6 2021-04-01 A 0
7 2021-05-01 A 0
8 2021-06-01 A 0
9 2021-03-01 B 2
10 2021-04-01 B 42
11 2021-05-01 B 20
12 2021-06-01 B 4
DataFrame:
import pandas as pd
mydf = pd.DataFrame({
'Dt': ['2021-03-01', '2021-04-01', '2021-05-01', '2021-06-01', '2020-10-01',
'2020-11-01', '2020-12-01'],
'Id': ['B', 'B', 'B', 'B', 'A', 'A', 'A'],
'Sales': [2, 42, 20, 4, 47, 67, 46]
})
mydf['Dt'] = pd.to_datetime(mydf['Dt'])
An alternative using pd.MultiIndex with list comprehension:
s = (pd.MultiIndex.from_tuples([[x, d]
for x, y in df.groupby("Id")["Dt"]
for d in pd.date_range(min(y), max(df["Dt"]), freq="MS")], names=["Id", "Dt"]))
print (df.set_index(["Id", "Dt"]).reindex(s, fill_value=0).reset_index())
Here is a different approach:
from itertools import product
# compute the min-max date range
date_range = pd.date_range(*mydf['Dt'].agg(['min', 'max']), freq='MS', name='Dt')
# make MultiIndex per group, keep only values above min date per group
idx = pd.MultiIndex.from_tuples([e for Id,Dt_min in mydf.groupby('Id')['Dt'].min().items()
for e in list(product(date_range[date_range>Dt_min],
[Id]))
])
# concatenate the original dataframe and the missing indexes
mydf = mydf.set_index(['Dt', 'Id'])
mydf = pd.concat([mydf,
mydf.reindex(idx.difference(mydf.index)).fillna(0)]
).sort_index(level=1).reset_index()
mydf
output:
Dt Id Sales
0 2020-10-01 A 47.0
1 2020-11-01 A 67.0
2 2020-12-01 A 46.0
3 2021-01-01 A 0.0
4 2021-02-01 A 0.0
5 2021-03-01 A 0.0
6 2021-04-01 A 0.0
7 2021-05-01 A 0.0
8 2021-06-01 A 0.0
9 2021-03-01 B 2.0
10 2021-04-01 B 42.0
11 2021-05-01 B 20.0
12 2021-06-01 B 4.0
We can use the complete function from pyjanitor to expose the missing values:
Convert Dt to datetime:
df['Dt'] = pd.to_datetime(df['Dt'])
Create a mapping of Dt to new values, via pd.date_range, and set the frequency to monthly begin (MS):
max_time = df.Dt.max()
new_values = {"Dt": lambda df:pd.date_range(df.min(), max_time, freq='1MS')}
# pip install pyjanitor
import janitor
import pandas as pd
df.complete([new_values], by='Id').fillna(0)
Id Dt Sales
0 A 2020-10-01 47.0
1 A 2020-11-01 67.0
2 A 2020-12-01 46.0
3 A 2021-01-01 0.0
4 A 2021-02-01 0.0
5 A 2021-03-01 0.0
6 A 2021-04-01 0.0
7 A 2021-05-01 0.0
8 A 2021-06-01 0.0
9 B 2021-03-01 2.0
10 B 2021-04-01 42.0
11 B 2021-05-01 20.0
12 B 2021-06-01 4.0
Sticking to Pandas only, we can combine apply, with groupby and reindex; thankfully, Dt is unique, so we can safely reindex:
(df
.set_index('Dt')
.groupby('Id')
.apply(lambda df: df.reindex(pd.date_range(df.index.min(),
max_time,
freq='1MS'),
fill_value = 0)
)
.drop(columns='Id')
.rename_axis(['Id', 'Dt'])
.reset_index())
Id Dt Sales
0 A 2020-10-01 47
1 A 2020-11-01 67
2 A 2020-12-01 46
3 A 2021-01-01 0
4 A 2021-02-01 0
5 A 2021-03-01 0
6 A 2021-04-01 0
7 A 2021-05-01 0
8 A 2021-06-01 0
9 B 2021-03-01 2
10 B 2021-04-01 42
11 B 2021-05-01 20
12 B 2021-06-01 4

Compare two dataframes based on column data in Python pandas

I have two dataframes, df1 and df2, and I would like to substruct the df2 from df1 and using as a row comparison a specific column, 'Code'
import pandas as pd
import numpy as np
rng = pd.date_range('2021-01-01', periods=10, freq='D')
df1 = pd.DataFrame(index=rng, data={'Val1': range(10), 'Val2': np.array(range(10))*5, 'Code': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]})
df2 = pd.DataFrame(data={'Code': [1, 2, 3, 4], 'Val1': [10, 5, 15, 20], 'Val2': [4, 8, 10, 7]})
df1:
Val1 Val2 Code
2021-01-01 0 0 1
2021-01-02 1 5 1
2021-01-03 2 10 1
2021-01-04 3 15 2
2021-01-05 4 20 2
2021-01-06 5 25 2
2021-01-07 6 30 3
2021-01-08 7 35 3
2021-01-09 8 40 3
2021-01-10 9 45 3
df2:
Code Val1 Val2
0 1 10 4
1 2 5 8
2 3 15 10
3 4 20 7
I using the following code:
df = (df1.set_index(['Code']) - df2.set_index(['Code']))
and the result is
Code
1 -10.0 -4.0
1 -9.0 1.0
1 -8.0 6.0
2 -2.0 7.0
2 -1.0 12.0
2 0.0 17.0
3 -9.0 20.0
3 -8.0 25.0
3 -7.0 30.0
3 -6.0 35.0
4 NaN NaN
However, I only want to get the results for the rows that are in df1 and not the missing keys, in this example the 4.
How do I do it and then to set back the index to the original from df1?
Something like that but it doesn't work:
df = (df1.set_index(['Code']) - df2.set_index(['Code'])).set_index(df1['Code'])
Also I would like to keep the headers of the columns.
Desired output:
Val1 Val2 Code
Date
2021-01-01 -10.0 -4.0 1
2021-01-02 -9.0 1.0 1
2021-01-03 -8.0 6.0 1
2021-01-04 -2.0 7.0 2
2021-01-05 -1.0 12.0 2
2021-01-06 0.0 17.0 2
2021-01-07 -9.0 20.0 3
2021-01-08 -8.0 25.0 3
2021-01-09 -7.0 30.0 3
2021-01-10 -6.0 35.0 3
If you want to get the results for the rows that are in df1 and not the missing keys, in this example the 4 then just use drop_na() method
df = (df1.set_index(['Code']) - df2.set_index(['Code'])).dropna()
then:-
df.insert(0,'Date',df1.index)
And Finally:-
df.reset_index(inplace=True)
df.set_index('Date',inplace=True)
Now if you print df you will get your desired output:-
Code Val1 Val2
Date
2021-01-01 1 -10.0 -4.0
2021-01-02 1 -9.0 1.0
2021-01-03 1 -8.0 6.0
2021-01-04 2 -2.0 7.0
2021-01-05 2 -1.0 12.0
2021-01-06 2 0.0 17.0
2021-01-07 3 -9.0 20.0
2021-01-08 3 -8.0 25.0
2021-01-09 3 -7.0 30.0
2021-01-10 3 -6.0 35.0
Note:-In case this is not your desired output then let me know
You can use reindex to align df2 to df1["code"]. Then we can take the underlying numpy ndarray and subtract that inplace from the corresponding columns df1. This will leave both the index and the "code" column untouched and perform subtraction as expected.
subtract_values = df2.set_index("Code").reindex(df1["Code"]).to_numpy()
df1[["Val1", "Val2"]] -= subtract_values
print(df1)
Val1 Val2 Code
2021-01-01 -10 -4 1
2021-01-02 -9 1 1
2021-01-03 -8 6 1
2021-01-04 -2 7 2
2021-01-05 -1 12 2
2021-01-06 0 17 2
2021-01-07 -9 20 3
2021-01-08 -8 25 3
2021-01-09 -7 30 3
2021-01-10 -6 35 3
If you don't want to change df1, you can copy the data to a new DataFrame via new_df = df1.copy() and proceeding with new_df instead of df1

django- get all model instances with overlapping date ranges

Lets say I have the following model:
class DateRange(models.Model):
start = models.DateTimeField()
end = models.DateTimeField()
Is there a way to get all pairs of DateRange instances with any overlap in their start to end range? For example, if I had:
id | start | end
-----+---------------------+---------------------
1 | 2020-01-02 12:00:00 | 2020-01-02 16:00:00 # overlap with 2, 3 and 5
2 | 2020-01-02 13:00:00 | 2020-01-02 14:00:00 # overlap with 1 and 3
3 | 2020-01-02 13:30:00 | 2020-01-02 17:00:00 # overlap with 1 and 2
4 | 2020-01-02 10:00:00 | 2020-01-02 10:30:00 # no overlap
5 | 2020-01-02 12:00:00 | 2020-01-02 12:30:00 # overlap with 1
I'd want:
id_1 | id_2
------+-----
1 | 2
1 | 3
1 | 5
2 | 3
Any thoughts on the best way to do this? The order of id_1 and id_2 doesn't matter, but I do need it to be distinct (e.g.- id_1=1, id_2=2 is the same as id_1=2, id_2=1 and should not be repeated)

Pandas row-wise aggregation with multi-index

I have a pandas dataframe where there's three levels of row indexing. The last level is a datetime index. There are nan values and I am trying to fill them with the average of each row at the datetime level. How can I go about doing this?
data_df
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | nan
2019-01-28 18:00:00 | 2 | nan | 1
2019-01-28 19:00:00 | nan | nan | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | nan | nan | nan
Some rows may all be nan values. In this case I want to fill the row with 0's. Some rows may have all values filled in so imputing with average isn't needed.
I want this the following result:
Level 0 | Level 1 | Level 2 |
A 123 2019-01-28 17:00:00 | 3 | 1 | 2
2019-01-28 18:00:00 | 2 | 1.5 | 1
2019-01-28 19:00:00 | 5 | 5 | 5
234 2019-01-28 05:00:00 | 1 | 1 | 3
2019-01-28 06:00:00 | 0 | 0 | 0
Use DataFrame.mask with mean per rows and last convert only NaNs rows by DataFrame.fillna:
df = df.mask(df.isna(), df.mean(axis=1), axis=0).fillna(0)
print (df)
a b c
Level 0 Level 1 Level 2
A 123 2019-01-28 17:00:00 3.0 1.0 2.0
2019-01-28 18:00:00 2.0 1.5 1.0
2019-01-28 19:00:00 5.0 5.0 5.0
234 2019-01-28 05:00:00 1.0 1.0 3.0
2019-01-28 06:00:00 0.0 0.0 0.0
Another solution is use DataFrame.fillna for replace, but because not implemented df.fillna(df.mean(axis=1), axis=1) is necessary double transpose:
df = df.T.fillna(df.mean(axis=1)).fillna(0).T

Categories

Resources