pandas pivot table for heatmap - python

I am trying to generate a heatmap using seaborn, however I am having a small problem with the formatting of my data.
Currently, my data is in the form:
Name Diag Date
A 1 2006-12-01
A 1 1994-02-12
A 2 2001-07-23
B 2 1999-09-12
B 1 2016-10-12
C 3 2010-01-20
C 2 1998-08-20
I would like to create a heatmap (preferably in python) showing Name on one axis against Diag - if occured. I have tried to pivot the table using pd.pivot, however I was given the error
ValueError: Index contains duplicate entries, cannot reshape
this came from:
piv = df.pivot_table(index='Name',columns='Diag')
Time is irrelevant, but I would like to show which Names have had which Diag, and which Diag combos cluster together. Do I need to create a new table for this or is it possible for that I have? In some cases the Name is not associated with all Diag
EDIT:
I have since tried:
piv = df.pivot_table(index='Name',columns='Diag', values='Time', aggfunc='mean')
However as Time is in datetime format, I end up with:
pandas.core.base.DataError: No numeric types to aggregate

You need pivot_table with some aggregate function, because for same index and column have multiple values and pivot need unique values only:
print (df)
Name Diag Time
0 A 1 12 <-duplicates for same A, 1 different value
1 A 1 13 <-duplicates for same A, 1 different value
2 A 2 14
3 B 2 18
4 B 1 1
5 C 3 9
6 C 2 8
df = df.pivot_table(index='Name',columns='Diag', values='Time', aggfunc='mean')
print (df)
Diag 1 2 3
Name
A 12.5 14.0 NaN
B 1.0 18.0 NaN
C NaN 8.0 9.0
Alternative solution:
df = df.groupby(['Name','Diag'])['Time'].mean().unstack()
print (df)
Diag 1 2 3
Name
A 12.5 14.0 NaN
B 1.0 18.0 NaN
C NaN 8.0 9.0
EDIT:
You can also check all duplicates by duplicated:
df = df.loc[df.duplicated(['Name','Diag'], keep=False), ['Name','Diag']]
print (df)
Name Diag
0 A 1
1 A 1
EDIT:
mean of datetimes is not easy - need convert dates to nanoseconds, get mean and last convert to datetimes. Also there is another problem - need replace NaN to some scalar, e.g. 0 what is converted to 0 datetime - 1970-01-01.
df.Date = pd.to_datetime(df.Date)
df['dates_in_ns'] = pd.Series(df.Date.values.astype(np.int64), index=df.index)
df = df.pivot_table(index='Name',
columns='Diag',
values='dates_in_ns',
aggfunc='mean',
fill_value=0)
df = df.apply(pd.to_datetime)
print (df)
Diag 1 2 3
Name
A 2000-07-07 12:00:00 2001-07-23 1970-01-01
B 2016-10-12 00:00:00 1999-09-12 1970-01-01
C 1970-01-01 00:00:00 1998-08-20 2010-01-20

Related

Expanding multi-indexed dataframe with new dates as forecast

Note: I have followed Stackoverflow's instruction of how to create MRE and paste the MRE into 'code block' as instructed (i.e. paste it in the Body and then press Ctrl+K when highlighting it). If I am still not doing it correctly, let me know.
Back to question: Suppose I now have a df multi-indexed in both the date (df['DT']) and ID (df['ID'])
DT,ID,value1,value2
2020-10-01,a,1,1
2020-10-01,b,2,1
2020-10-01,c,3,1
2020-10-01,d,4,1
2020-10-02,a,10,1
2020-10-02,b,11,1
2020-10-02,c,12,1
2020-10-02,d,13,1
df = df.set_index(['DT','ID'])
And now, I want to expand the df to have '2020-10-03' and '2020-10-04' with the same set of ID {a,b,c,d} as my forecast period. To forecast value 1, I assume they will take the average of the existing values, e.g. for a's value1 in both 2020-10-03' and '2020-10-04', I assume it will take (1+10)/2 = 5.5. For value 2, I assume it will stay constant as 1.
The expected df will look like this:
DT,ID,value1,value2
2020-10-01,a,1.0,1
2020-10-01,b,2.0,1
2020-10-01,c,3.0,1
2020-10-01,d,4.0,1
2020-10-02,a,10.0,1
2020-10-02,b,11.0,1
2020-10-02,c,12.0,1
2020-10-02,d,13.0,1
2020-10-03,a,5.5,1
2020-10-03,b,6.5,1
2020-10-03,c,7.5,1
2020-10-03,d,8.5,1
2020-10-04,a,5.5,1
2020-10-04,b,6.5,1
2020-10-04,c,7.5,1
2020-10-04,d,8.5,1
Appreciate your help and time.
For easy forecast with mean use DataFrame.unstack for DatetimeIndex, add next datetimes by DataFrame.reindex with date_range and then replace missing values in value1 level by DataFrame.fillna and for value2 is set 1, last reshape back by DataFrame.stack:
print (df)
value1 value2
DT ID
2020-10-01 a 1 1
b 2 1
c 3 1
d 4 1
2020-10-02 a 10 1
b 11 1
c 12 1
d 13 1
rng = pd.date_range('2020-10-01','2020-10-04', name='DT')
df1 = df.unstack().reindex(rng)
df1['value1'] = df1['value1'].fillna(df1['value1'].mean())
df1['value2'] = 1
df2 = df1.stack()
print (df2)
value1 value2
DT ID
2020-10-01 a 1.0 1
b 2.0 1
c 3.0 1
d 4.0 1
2020-10-02 a 10.0 1
b 11.0 1
c 12.0 1
d 13.0 1
2020-10-03 a 5.5 1
b 6.5 1
c 7.5 1
d 8.5 1
2020-10-04 a 5.5 1
b 6.5 1
c 7.5 1
d 8.5 1
But forecasting is more complex, you can check this

Aggregation in pandas dataframe with columns names in one row

I am using Python 3.6 and I am doing an aggregation, which I have done correctly, but the column names are not in the form I want.
df = pd.DataFrame({'ID':[1,1,2,2,2],
'revenue':[1,3,5,1,5],
'month':['2012-01-01','2012-01-01','2012-03-01','2014-01-01','2012-01-01']})
print(df)
ID month revenue
0 1 2012-01-01 1
1 1 2012-01-01 3
2 2 2012-03-01 5
3 2 2014-01-01 1
4 2 2012-01-01 5
Doing the aggregation below.
df = df.groupby(['ID']).agg({'revenue':'sum','month':[('distinct_m','nunique'),('month_m','first')]}).reset_index()
print(df)
ID revenue month
sum distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
Desired output is:
ID revenue distinct_m month
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
The problem is that I am using a mixed form of expressions inside agg(). Had it been only agg('revenue':'sum'), I would have got a column named revenue in precisely the same format I wanted, as shown below:
ID revenue
0 1 4
1 2 11
But, since I am creating 2 additional columns as well, using tuple form ('distinct_m','nunique'),('month_m','first'), I get column names spread across two rows.
Is there a way to get the desired output shown above in one aggregation agg()? I want to avoid using tuple form for 'revenue':'sum'. I am not looking for multiple operations afterwards to get the column names right. I am using Python 3.6.
For avoid this problem is used named aggregations working in pandas 0.25+, where is possible specify each columns names:
df = (df.groupby(['ID']).agg(revenue=('revenue','sum'),
distinct_m=('month','nunique'),
month_m = ('month','first')
).reset_index())
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01
For lower pandas versions is possible flatten columns in MultiIndex and then rename:
df = df.groupby(['ID']).agg({'revenue':'sum',
'month':[('distinct_m','nunique'),('month_m','first')]})
df.columns = df.columns.map('_'.join)
df = df.rename(columns={'revenue_sum':'revenue',
'month_distinct_m':'distinct_m',
'month_month_m':'month_m'})
df = df.reset_index()
print(df)
ID revenue distinct_m month_m
0 1 4 1 2012-01-01
1 2 11 3 2012-03-01

Pandas: Sum multiple columns, but write NaN if any column in that row is NaN or 0

I am trying to create a new column in a pandas dataframe that sums the total of other columns. However, if any of the source columns are blank (NaN or 0), I need the new column to also be written as blank (NaN)
a b c d sum
3 5 7 4 19
2 6 0 2 NaN (note the 0 in column c)
4 NaN 3 7 NaN
I am currently using the pd.sum function, formatted like this
df['sum'] = df[['a','b','c','d']].sum(axis=1, numeric_only=True)
which ignores the NaNs, but does not write NaN to the sum column.
Thanks in advance for any advice
replace your 0 to np.nan then pass skipna = False
df.replace(0,np.nan).sum(1,skipna=False)
0 19.0
1 NaN
2 NaN
dtype: float64
df['sum'] = df.replace(0,np.nan).sum(1,skipna=False)

Add pandas Series as new columns to a specific Dataframe row

Say I have a Dataframe
df = pd.DataFrame({'A':[0,1],'B':[2,3]})
A B
0 0 2
1 1 3
Then I have a Series generated by some other function using inputs from the first row of the df but which has no overlap with the existing df
s = pd.Series ({'C':4,'D':6})
C 4
D 6
Now I want to add s to df.loc[0] with the keys becoming new columns and the values added only to this first row. The end result for df should look like:
A B C D
0 0 2 4 6
1 1 3 NaN NaN
How would I do that? Similar questions I've found only seem to look at doing this for one column or just adding the Series as a new row at the end of the DataFrame but not updating an existing row by adding multiple new columns from a Series.
I've tried df.loc[0,list(['C','D'])] = [4,6] which was suggested in another answer but that only works if ['C','D'] are already existing columns in the Dataframe. df.assign(**s) works but then assigns the Series values to all rows.
join with transpose:
df.join(pd.DataFrame(s).T)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN
Or use concat
pd.concat([df, pd.DataFrame(s).T], axis=1)
A B C D
0 0 2 4.0 6.0
1 1 3 NaN NaN

How to interpolate grouped time series in a Pandas dataframe

I have data in type pd.DataFrame which looks like the following:
type date sum
A Jan-1 1
A Jan-3 2
B Feb-1 1
B Feb-2 3
B Feb-5 6
The task is to build a continuous time series for each type (the missing date should be filled with 0).
The expected result is:
type date sum
A Jan-1 1
A Jan-2 0
A Jan-3 2
B Feb-1 1
B Feb-2 3
B Feb-3 0
B Feb-4 0
B Feb-5 6
Is it possible to do that with pandas or other Python tools?
The real dataset has millions of rows.
You first must change your date to a datetime and put that column in the index to take advantage of resampling and then you can convert the date back to its original format
# change to datetime
df['date'] =pd.to_datetime(df.date, format="%b-%d")
df = df.set_index('date')
# resample to fill in missing dates
df1 = df.groupby('type').resample('d')['sum'].asfreq().fillna(0)
df1 = df1.reset_index()
# change back to original date format
df1['date'] = df1.date.dt.strftime('%b-%d')
output
type date sum
0 A Jan-01 1.0
1 A Jan-02 0.0
2 A Jan-03 2.0
3 B Feb-01 1.0
4 B Feb-02 3.0
5 B Feb-03 0.0
6 B Feb-04 0.0
7 B Feb-05 6.0

Categories

Resources