Fill missing dates - python

I have dataframe contains temperature readings from different areas and in different dates
I want to add the missing dates for each location with zero temperature
for example:
df=pd.DataFrame({"area_id":[1,1,1,2,2,2,3,3,3],
"reading_date":["13/1/2017","15/1/2017"
,"16/1/2017","22/3/2017","26/3/2017"
,"28/3/2017","15/5/2017"
,"16/5/2017","18/5/2017"],
"temp":[12,15,22,6,14,8,30,25,33]})
What is the most efficient way to fill dates gap per area (by zeros) as shown below
Many Thanks.

Use:
first convert to datetime column reading_date by to_datetime
set_index for DatetimeIndex and groupby with resample
for Series add asfreq
replace NaNs by fillna
last add reset_index for columns from MultiIndex
df['reading_date'] = pd.to_datetime(df['reading_date'])
df = (df.set_index('reading_date')
.groupby('area_id')
.resample('d')['temp']
.asfreq()
.fillna(0)
.reset_index())
print (df)
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0

Using reindex. Define a custom function to handle the reindexing operation, and call it inside groupby.apply.
def reindex(x):
# Thanks to #jezrael for the improvement.
return x.reindex(pd.date_range(x.index.min(), x.index.max()), fill_value=0)
Next, convert reading_date to datetime first, using pd.to_datetime,
df.reading_date = pd.to_datetime(df.reading_date)
Now, perform a groupby.
df = (
df.set_index('reading_date')
.groupby('area_id')
.temp
.apply(reindex)
.reset_index()
)
df.columns = ['area_id', 'reading_date', 'temp']
df
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0

Related

Creating a difference of the data frame with 1 col in python

I am new to python and trying to replicate things in python that are done in excel.
I want to take difference of the col(A,B,C) with the mean(fixed)
Unnamed:0 A B C mean Mean diffA
0 2020-08-28 1 6 11 6.0 -5.0
1 2020-08-29 2 7 12 7.0 -5.0
2 2020-08-30 3 8 13 8.0 -5.0
3 2020-08-31 4 9 14 9.0 -5.0
4 2020-09-01 5 10 15 10.0 -5.0
1 way is to manually put in the col name and find the difference, but is there any other less manual way?
new_df['Mean diffA']=new_df['A']-new_df['mean']
You can subtract the mean from a range of columns:
diffs = new_df.loc[:, 'A':'C'].subtract(new_df['mean'], axis=0)
Then combine the differences and the original DataFrame:
new_df.join(diffs, rsuffix='_mean')
You have a few options to do this. I tried the following and it worked.
import pandas as pd
dt = {'START DATE':['2020-08-28','2020-08-29','2020-08-30',
'2020-08-31','2020-09-01'],
'A':[1,2,3,4,5],
'B':[6,7,8,9,10],
'C':[11,12,13,14,15]}
df = pd.DataFrame(dt)
df['Mean'] = df.loc[:,'A':'C'].mean(axis=1)
df[['dA','dB','dC']] = df.loc[:, 'A':'C'].subtract(df['Mean'], axis=0)
print(df)
OR you can try and do something like this as well
df[['dA','dB','dC']] = df.loc[:,'A':'C'] - df[['Mean','Mean','Mean']].values
print(df)
Both of these will provide the same output:
START DATE A B C Mean dA dB dC
0 2020-08-28 1 6 11 6.0 -5.0 0.0 5.0
1 2020-08-29 2 7 12 7.0 -5.0 0.0 5.0
2 2020-08-30 3 8 13 8.0 -5.0 0.0 5.0
3 2020-08-31 4 9 14 9.0 -5.0 0.0 5.0
4 2020-09-01 5 10 15 10.0 -5.0 0.0 5.0
The second option is not a good way to use it. Pandas provides you subtract function. Use this.

Is there a way to apply a function to a MultiIndex dataframe slice with the same outer index without iterating each slice?

Basically, what I'm trying to accomplish is to fill the missing dates (creating new DataFrame rows) with respect to each product, then create a new column based on a cumulative sum of column 'A' (example shown below)
The data is a MultiIndex with (product, date) as indexes.
Basically I would like to apply this answer to a MultiIndex DataFrame using only the rightmost index and calculating a subsequent np.cumsum for each product (and all dates).
A
product date
0 2017-01-02 1
2017-01-03 2
2017-01-04 2
2017-01-05 1
2017-01-06 4
2017-01-07 1
2017-01-10 7
1 2018-06-29 1
2018-06-30 4
2018-07-01 1
2018-07-02 1
2018-07-04 2
What I want to accomplish (efficiently) is:
A CumSum
product date
0 2017-01-02 1 1
2017-01-03 2 3
2017-01-04 2 5
2017-01-05 1 6
2017-01-06 4 10
2017-01-07 1 11
2017-01-08 0 11
2017-01-09 0 11
2017-01-10 7 18
1 2018-06-29 1 1
2018-06-30 4 5
2018-07-01 1 6
2018-07-02 1 7
2018-07-03 0 7
2018-07-04 2 9
You have 2 ways:
One way:
Using groupby with apply and with resample and cumsum. Finally, pd.concat result with df.A and fillna with 0
s = (df.reset_index(0).groupby('product').apply(lambda x: x.resample(rule='D')
.asfreq(0).A.cumsum()))
pd.concat([df.A, s.rename('cumsum')], axis=1).fillna(0)
Out[337]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9
Another way:
you need 2 groupbys. First one for resample, 2nd one for cumsum. Finally, use pd.concat and fillna with 0
s1 = df.reset_index(0).groupby('product').resample(rule='D').asfreq(0).A
pd.concat([df.A, s1.groupby(level=0).cumsum().rename('cumsum')], axis=1).fillna(0)
Out[351]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9

better grouping of label frequency by month from dataframe

I have a dataframe with a date+time and a label, which I want to reshape into date (/month) columns with label frequencies for that month:
date_time label
1 2017-09-26 17:08:00 0
3 2017-10-03 13:27:00 2
4 2017-10-04 19:04:00 0
11 2017-10-11 18:28:00 1
27 2017-10-13 11:22:00 0
28 2017-10-13 21:43:00 0
39 2017-10-16 14:43:00 0
40 2017-10-16 21:39:00 0
65 2017-10-21 21:53:00 2
...
98 2017-11-01 20:08:00 3
99 2017-11-02 12:00:00 3
100 2017-11-02 12:01:00 2
109 2017-11-02 12:03:00 3
110 2017-11-03 22:24:00 0
111 2017-11-04 09:05:00 3
112 2017-11-06 12:36:00 3
113 2017-11-06 12:48:00 2
128 2017-11-07 15:20:00 2
143 2017-11-10 16:36:00 3
144 2017-11-10 20:00:00 0
145 2017-11-10 20:02:00 0
I group the label frequency by month with this line (thanks partially to this post):
df2 = df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count()
which outputs
date_time label
2017-09-30 0 1
2017-10-31 0 6
1 1
2 8
3 2
2017-11-30 0 25
4 2
5 1
2 4
3 11
2017-12-31 0 14
5 3
2 5
3 7
2018-01-31 0 8
4 1
5 1
2 2
3 3
but, as mentioned before, I would like to get the data by month/date columns:
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
currently I can do sort of divide the data with
pd.concat([df2[m] for m in df2.index.levels[0]], axis=1).fillna(0)
but I lose the column names:
label label label label label
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
So I have to do a longer version where I generate a series, rename it, concatenate and then fill in the blanks:
m_list = []
for m in df2.index.levels[0]:
m_labels = df2[m]
m_labels = m_labels.rename(m)
m_list.append(m_labels)
pd.concat(m_list, axis=1).fillna(0)
resulting in
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
Is there a shorter/more elegant way to get to this last datagrame from my original one?
You just need unstack here
df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count().unstack(0,fill_value=0)
Out[235]:
date_time 2017-09-30 2017-10-31 2017-11-30
label
0 1 5 3
1 0 1 0
2 0 2 3
3 0 0 6
Base on your groupby output
s.unstack(0,fill_value=0)
Out[240]:
date_time 2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
label
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1

Pandas - How to group sub columns of a dataframe?

I create the following dataframe:
Date ProductID SubProductId Value
0 2015-01-02 1 1 11
1 2015-01-02 1 2 12
2 2015-01-02 1 3 NaN
3 2015-01-02 1 4 NaN
4 2015-01-02 2 1 14
5 2015-01-02 2 2 15
6 2015-01-02 2 3 16
7 2015-01-03 1 1 17
8 2015-01-03 1 2 18
9 2015-01-03 1 3 NaN
10 2015-01-03 1 4 21
11 2015-01-03 2 1 20
12 2015-01-03 2 2 21
And then I group the subproducts by products:
df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
and I would like to get the following:
Value
ProductID 1 2
SubProductId 1 2 3 4 1 2 3
Date
2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN
But what it does when I print it is that it pulls every column that start with some NaN at the end:
Value
ProductID 1 2 1
SubProductId 1 2 1 2 3 4 3
Date
2015-01-02 11.0 12.0 14.0 15.0 16.0 NaN NaN
2015-01-03 17.0 18.0 20.0 21.0 NaN 21.0 NaN
How to have every sub columns grouped under its corresponding column ? even the sub columns that contain NaN
NB: Versions used:
Python version: 3.6.0
Pandas version: 0.19.2
If you want to have ordered column names, you can use sort_level with axis = 1 to sort the column index:
df1 = df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
# sort in descending order
df1.sortlevel(axis=1, ascending=False)
# Value
#ProductID 2 1
#SubProductId 3 2 1 4 3 2 1
#Date
#2015-01-02 16.0 15.0 14.0 NaN NaN 12.0 11.0
#2015-01-03 NaN 21.0 20.0 21.0 NaN 18.0 17.0
# sort in ascending order
df1.sortlevel(axis=1, ascending=True)
# Value
#ProductID 1 2
#SubProductId 1 2 3 4 1 2 3
#Date
#2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
#2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN

Removing incomplete seasons from multi-index dataframe (pandas)

Trying to apply the method from here to a multi-index dataframe, doesn't seem to work.
Take a data-frame:
import pandas as pd
import numpy as np
dates = pd.date_range('20070101',periods=3200)
df = pd.DataFrame(data=np.random.randint(0,100,(3200,1)), columns =list('A'))
df['A'][5,6,7, 8, 9, 10, 11, 12, 13] = np.nan #add missing data points
df['date'] = dates
df = df[['date','A']]
Apply season function to the datetime index
def get_season(row):
if row['date'].month >= 3 and row['date'].month <= 5:
return '2'
elif row['date'].month >= 6 and row['date'].month <= 8:
return '3'
elif row['date'].month >= 9 and row['date'].month <= 11:
return '4'
else:
return '1'
Apply the function
df['Season'] = df.apply(get_season, axis=1)
Create a 'Year' column for indexing
df['Year'] = df['date'].dt.year
Multi-index by Year and Season
df = df.set_index(['Year', 'Season'], inplace=False)
Count datapoints in each season
count = df.groupby(level=[0, 1]).count()
Drop the seasons with less than 75 days in them
count = count.drop(count[count.A < 75].index)
Create a variable for seasons with more than 75 days
complete = count[count['A'] >= 75].index
Using isin function turns up false for everything, while I want it to select all the seasons who have more than 75 days of valid data in 'A'
df = df.isin(complete)
df
Every value comes up false, and I can't see why.
I hope this is concise enough, I need this to work on a multi-index using seasons so I included it!
EDIT
Another method based on multi-index reindexing not working (which also produces a blank dataframe) from here
df3 = df.reset_index().groupby('Year').apply(lambda x: x.set_index('Season').reindex(count,method='pad'))
EDIT 2
Also tried this
seasons = count[count['A'] >= 75].index
df = df[df['A'].isin(seasons)]
Again, blank output
I think you can use Index.isin:
complete = count[count['A'] >= 75].index
idx = df.index.isin(complete)
print idx
[ True True True ..., False False False]
print df[idx]
date A
Year Season
2007 1 2007-01-01 24.0
1 2007-01-02 92.0
1 2007-01-03 54.0
1 2007-01-04 91.0
1 2007-01-05 91.0
1 2007-01-06 NaN
1 2007-01-07 NaN
1 2007-01-08 NaN
1 2007-01-09 NaN
1 2007-01-10 NaN
1 2007-01-11 NaN
1 2007-01-12 NaN
1 2007-01-13 NaN
1 2007-01-14 NaN
1 2007-01-15 18.0
1 2007-01-16 82.0
1 2007-01-17 55.0
1 2007-01-18 64.0
1 2007-01-19 89.0
1 2007-01-20 37.0
1 2007-01-21 45.0
1 2007-01-22 4.0
1 2007-01-23 34.0
1 2007-01-24 35.0
1 2007-01-25 90.0
1 2007-01-26 17.0
1 2007-01-27 29.0
1 2007-01-28 58.0
1 2007-01-29 7.0
1 2007-01-30 57.0
... ... ...
2015 3 2015-08-02 42.0
3 2015-08-03 0.0
3 2015-08-04 31.0
3 2015-08-05 39.0
3 2015-08-06 25.0
3 2015-08-07 1.0
3 2015-08-08 7.0
3 2015-08-09 97.0
3 2015-08-10 38.0
3 2015-08-11 59.0
3 2015-08-12 28.0
3 2015-08-13 84.0
3 2015-08-14 43.0
3 2015-08-15 63.0
3 2015-08-16 68.0
3 2015-08-17 0.0
3 2015-08-18 19.0
3 2015-08-19 61.0
3 2015-08-20 11.0
3 2015-08-21 84.0
3 2015-08-22 75.0
3 2015-08-23 37.0
3 2015-08-24 40.0
3 2015-08-25 66.0
3 2015-08-26 50.0
3 2015-08-27 74.0
3 2015-08-28 37.0
3 2015-08-29 19.0
3 2015-08-30 25.0
3 2015-08-31 15.0
[3106 rows x 2 columns]

Categories

Resources