Note: I have followed Stackoverflow's instruction of how to create MRE and paste the MRE into 'code block' as instructed (i.e. paste it in the Body and then press Ctrl+K when highlighting it). If I am still not doing it correctly, let me know.
Back to question: Suppose I now have a df multi-indexed in both the date (df['DT']) and ID (df['ID'])
DT,ID,value1,value2
2020-10-01,a,1,1
2020-10-01,b,2,1
2020-10-01,c,3,1
2020-10-01,d,4,1
2020-10-02,a,10,1
2020-10-02,b,11,1
2020-10-02,c,12,1
2020-10-02,d,13,1
df = df.set_index(['DT','ID'])
And now, I want to expand the df to have '2020-10-03' and '2020-10-04' with the same set of ID {a,b,c,d} as my forecast period. To forecast value 1, I assume they will take the average of the existing values, e.g. for a's value1 in both 2020-10-03' and '2020-10-04', I assume it will take (1+10)/2 = 5.5. For value 2, I assume it will stay constant as 1.
The expected df will look like this:
DT,ID,value1,value2
2020-10-01,a,1.0,1
2020-10-01,b,2.0,1
2020-10-01,c,3.0,1
2020-10-01,d,4.0,1
2020-10-02,a,10.0,1
2020-10-02,b,11.0,1
2020-10-02,c,12.0,1
2020-10-02,d,13.0,1
2020-10-03,a,5.5,1
2020-10-03,b,6.5,1
2020-10-03,c,7.5,1
2020-10-03,d,8.5,1
2020-10-04,a,5.5,1
2020-10-04,b,6.5,1
2020-10-04,c,7.5,1
2020-10-04,d,8.5,1
Appreciate your help and time.
For easy forecast with mean use DataFrame.unstack for DatetimeIndex, add next datetimes by DataFrame.reindex with date_range and then replace missing values in value1 level by DataFrame.fillna and for value2 is set 1, last reshape back by DataFrame.stack:
print (df)
value1 value2
DT ID
2020-10-01 a 1 1
b 2 1
c 3 1
d 4 1
2020-10-02 a 10 1
b 11 1
c 12 1
d 13 1
rng = pd.date_range('2020-10-01','2020-10-04', name='DT')
df1 = df.unstack().reindex(rng)
df1['value1'] = df1['value1'].fillna(df1['value1'].mean())
df1['value2'] = 1
df2 = df1.stack()
print (df2)
value1 value2
DT ID
2020-10-01 a 1.0 1
b 2.0 1
c 3.0 1
d 4.0 1
2020-10-02 a 10.0 1
b 11.0 1
c 12.0 1
d 13.0 1
2020-10-03 a 5.5 1
b 6.5 1
c 7.5 1
d 8.5 1
2020-10-04 a 5.5 1
b 6.5 1
c 7.5 1
d 8.5 1
But forecasting is more complex, you can check this
I am using Python 3.6 and I am doing an aggregation, which I have done correctly, but the column names are not in the form I want.
df = pd.DataFrame({'ID':[1,1,2,2,2],
'revenue':[1,3,5,1,5],
'month':['2012-01-01','2012-01-01','2012-03-01','2014-01-01','2012-01-01']})
print(df)
ID month revenue
0 1 2012-01-01 1
1 1 2012-01-01 3
2 2 2012-03-01 5
3 2 2014-01-01 1
4 2 2012-01-01 5
Doing the aggregation below.
df = df.groupby(['ID']).agg({'revenue':'sum','month':[('distinct_m','nunique'),('month_m','first')]}).reset_index()
print(df)
ID revenue month
sum distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
Desired output is:
ID revenue distinct_m month
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
The problem is that I am using a mixed form of expressions inside agg(). Had it been only agg('revenue':'sum'), I would have got a column named revenue in precisely the same format I wanted, as shown below:
ID revenue
0 1 4
1 2 11
But, since I am creating 2 additional columns as well, using tuple form ('distinct_m','nunique'),('month_m','first'), I get column names spread across two rows.
Is there a way to get the desired output shown above in one aggregation agg()? I want to avoid using tuple form for 'revenue':'sum'. I am not looking for multiple operations afterwards to get the column names right. I am using Python 3.6.
For avoid this problem is used named aggregations working in pandas 0.25+, where is possible specify each columns names:
df = (df.groupby(['ID']).agg(revenue=('revenue','sum'),
distinct_m=('month','nunique'),
month_m = ('month','first')
).reset_index())
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
For lower pandas versions is possible flatten columns in MultiIndex and then rename:
df = df.groupby(['ID']).agg({'revenue':'sum',
'month':[('distinct_m','nunique'),('month_m','first')]})
df.columns = df.columns.map('_'.join)
df = df.rename(columns={'revenue_sum':'revenue',
'month_distinct_m':'distinct_m',
'month_month_m':'month_m'})
df = df.reset_index()
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
I am trying to create a new column in a pandas dataframe that sums the total of other columns. However, if any of the source columns are blank (NaN or 0), I need the new column to also be written as blank (NaN)
a b c d sum
3 5 7 4 19
2 6 0 2 NaN (note the 0 in column c)
4 NaN 3 7 NaN
I am currently using the pd.sum function, formatted like this
df['sum'] = df[['a','b','c','d']].sum(axis=1, numeric_only=True)
which ignores the NaNs, but does not write NaN to the sum column.
Thanks in advance for any advice
replace your 0 to np.nan then pass skipna = False
df.replace(0,np.nan).sum(1,skipna=False)
0 19.0
1 NaN
2 NaN
dtype: float64
df['sum'] = df.replace(0,np.nan).sum(1,skipna=False)
Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.
join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
I have data in type pd.DataFrame which looks like the following:
type date sum
A Jan-1 1
A Jan-3 2
B Feb-1 1
B Feb-2 3
B Feb-5 6
The task is to build a continuous time series for each type (the missing date should be filled with 0).
The expected result is:
type date sum
A Jan-1 1
A Jan-2 0
A Jan-3 2
B Feb-1 1
B Feb-2 3
B Feb-3 0
B Feb-4 0
B Feb-5 6
Is it possible to do that with pandas or other Python tools?
The real dataset has millions of rows.
You first must change your date to a datetime and put that column in the index to take advantage of resampling and then you can convert the date back to its original format
# change to datetime
df['date'] =pd.to_datetime(df.date, format="%b-%d")
df = df.set_index('date')
# resample to fill in missing dates
df1 = df.groupby('type').resample('d')['sum'].asfreq().fillna(0)
df1 = df1.reset_index()
# change back to original date format
df1['date'] = df1.date.dt.strftime('%b-%d')
output
type date sum
0 A Jan-01 1.0
1 A Jan-02 0.0
2 A Jan-03 2.0
3 B Feb-01 1.0
4 B Feb-02 3.0
5 B Feb-03 0.0
6 B Feb-04 0.0
7 B Feb-05 6.0