better grouping of label frequency by month from dataframe - python

I have a dataframe with a date+time and a label, which I want to reshape into date (/month) columns with label frequencies for that month:
date_time label
1 2017-09-26 17:08:00 0
3 2017-10-03 13:27:00 2
4 2017-10-04 19:04:00 0
11 2017-10-11 18:28:00 1
27 2017-10-13 11:22:00 0
28 2017-10-13 21:43:00 0
39 2017-10-16 14:43:00 0
40 2017-10-16 21:39:00 0
65 2017-10-21 21:53:00 2
...
98 2017-11-01 20:08:00 3
99 2017-11-02 12:00:00 3
100 2017-11-02 12:01:00 2
109 2017-11-02 12:03:00 3
110 2017-11-03 22:24:00 0
111 2017-11-04 09:05:00 3
112 2017-11-06 12:36:00 3
113 2017-11-06 12:48:00 2
128 2017-11-07 15:20:00 2
143 2017-11-10 16:36:00 3
144 2017-11-10 20:00:00 0
145 2017-11-10 20:02:00 0
I group the label frequency by month with this line (thanks partially to this post):
df2 = df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count()
which outputs
date_time label
2017-09-30 0 1
2017-10-31 0 6
1 1
2 8
3 2
2017-11-30 0 25
4 2
5 1
2 4
3 11
2017-12-31 0 14
5 3
2 5
3 7
2018-01-31 0 8
4 1
5 1
2 2
3 3
but, as mentioned before, I would like to get the data by month/date columns:
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1
currently I can do sort of divide the data with
pd.concat([df2[m] for m in df2.index.levels[0]], axis=1).fillna(0)
but I lose the column names:
label label label label label
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
So I have to do a longer version where I generate a series, rename it, concatenate and then fill in the blanks:
m_list = []
for m in df2.index.levels[0]:
m_labels = df2[m]
m_labels = m_labels.rename(m)
m_list.append(m_labels)
pd.concat(m_list, axis=1).fillna(0)
resulting in
2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
0 1.0 6.0 25.0 14.0 8.0
1 0.0 1.0 0.0 0.0 0.0
2 0.0 8.0 4.0 5.0 2.0
3 0.0 2.0 11.0 7.0 3.0
4 0.0 0.0 2.0 0.0 1.0
5 0.0 0.0 1.0 3.0 1.0
Is there a shorter/more elegant way to get to this last datagrame from my original one?

You just need unstack here
df.groupby([pd.Grouper(key='date_time', freq='M'), 'label'])['label'].count().unstack(0,fill_value=0)
Out[235]:
date_time 2017-09-30 2017-10-31 2017-11-30
label
0 1 5 3
1 0 1 0
2 0 2 3
3 0 0 6
Base on your groupby output
s.unstack(0,fill_value=0)
Out[240]:
date_time 2017-09-30 2017-10-31 2017-11-30 2017-12-31 2018-01-31
label
0 1 6 25 14 8
1 0 1 0 0 0
2 0 8 4 5 2
3 0 2 11 7 3
4 0 0 2 0 1
5 0 0 1 3 1

Related

Pandas - new column based on other row

I want to create a new column in my dataframe with the value of a other row.
DataFrame
TimeStamp Event Value
0 1603822620000 1 102.0
1 1603822680000 1 108.0
2 1603822740000 1 107.0
3 1603822800000 2 1
4 1603823040000 1 106.0
5 1603823100000 2 0
6 1603823160000 2 1
7 1603823220000 1 105.0
I would like to add a new column with the previous value where event = 1.
TimeStamp Event Value PrevValue
0 1603822620000 1 102.0 NaN
1 1603822680000 1 108.0 102.0
2 1603822740000 1 107.0 108.0
3 1603822800000 2 1 107.0
4 1603823040000 1 106.0 107.0
5 1603823100000 2 0 106.0
6 1603823160000 2 1 106.0
7 1603823220000 1 105.0 106.0
So I can't simply use shift(1) and also not groupBy(event).shift(1).
Current solution
df["PrevValue"] =df.timestamp.apply(lambda ts: (df[(df.Event == 1) & (df.timestamp < ts)].iloc[-1].value))
But I guess, that's not the best solution.
Is there something like shiftUntilCondition(condition)?
Thanks a lot!
Try with
df['new'] = df['Value'].where(df['Event']==1).ffill().shift()
Out[83]:
0 NaN
1 102.0
2 108.0
3 107.0
4 107.0
5 106.0
6 106.0
7 106.0
Name: Value, dtype: float64

pandas Dataframe: Subtract a groupby mean of subset data from the full original data

I would like to subtract [a groupby mean of subset] from the [original] dataframe:
I have a pandas DataFrame data whose index is in datetime object (monthly, say 100 years = 100yr*12mn) and 10 columns of station IDs. (i.e., 1200 row * 10 col pd.Dataframe)
1)
I would like to first take a subset of above data, e.g. top 50 years (i.e., 50yr*12mn),
data_sub = data_org[data_org.index.year <= top_50_year]
and calculate monthly mean for each month for each stations (columns). e.g.,
mean_sub = data_sub.groupby(data_sub.index.month).mean()
or
mean_sub = data_sub.groupby(data_sub.index.month).transform('mean')
which seem to do the job.
2)
Now I want to subtract above from the [original] NOT from the [subset], e.g.,
data_org - mean_sub
which I do not know how to. So in summary, I would like to calculate monthly mean from a subset of the original data (e.g., only using 50 years), and subtract that monthly mean from the original data month by month.
It was easy to subtract if I were using the full [original] data to calculate the mean (i.e., .transform('mean') or .apply(lambda x: x - x.mean()) do the job), but what should I do if the mean is calculated from a [subset] data?
Could you share your insight for this problem? Thank you in advance!
#mozway
The input (and also the output) shape looks like the following:
Input shape with random values
Only the values of output are anomalies from the [subset]'s monthly mean. Thank you.
One idea is replace non matched values to NaN by DataFrame.where, so after GroupBy.transform get same indices like original DataFrame, so possible subtract:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data1 = data_org.where(data_org.index.to_series().dt.year <= top_50_year)
print (data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 NaN NaN NaN
2001-04-30 NaN NaN NaN
2001-07-31 NaN NaN NaN
2001-10-31 NaN NaN NaN
2002-01-31 NaN NaN NaN
2002-04-30 NaN NaN NaN
mean_data1 = data1.groupby(data1.index.month).transform('mean')
print (mean_data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 2.0 2.0 6.0
2001-04-30 1.0 3.0 9.0
2001-07-31 6.0 1.0 0.0
2001-10-31 1.0 9.0 0.0
2002-01-31 2.0 2.0 6.0
2002-04-30 1.0 3.0 9.0
df = data_org - mean_data1
print (df)
0 1 2
2000-01-31 0.0 0.0 0.0
2000-04-30 0.0 0.0 0.0
2000-07-31 0.0 0.0 0.0
2000-10-31 0.0 0.0 0.0
2001-01-31 -2.0 7.0 -3.0
2001-04-30 3.0 -3.0 -9.0
2001-07-31 -2.0 0.0 7.0
2001-10-31 2.0 -7.0 4.0
2002-01-31 5.0 0.0 -2.0
2002-04-30 7.0 -3.0 -2.0
Another idea with filtering:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data_sub = data_org[data_org.index.year <= top_50_year]
print (data_sub)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
mean_sub = data_sub.groupby(data_sub.index.month).mean()
print (mean_sub)
0 1 2
1 2 2 6
4 1 3 9
7 6 1 0
10 1 9 0
Create new column m for months:
data_org['m'] = data_org.index.month
print (data_org)
0 1 2 m
2000-01-31 2 2 6 1
2000-04-30 1 3 9 4
2000-07-31 6 1 0 7
2000-10-31 1 9 0 10
2001-01-31 0 9 3 1
2001-04-30 4 0 0 4
2001-07-31 4 1 7 7
2001-10-31 3 2 4 10
2002-01-31 7 2 4 1
2002-04-30 8 0 7 4
And for this solumn are merged mean_sub by DataFrame.join
mean_data1 = data_org[['m']].join(mean_sub, on='m')
print (mean_data1)
m 0 1 2
2000-01-31 1 2 2 6
2000-04-30 4 1 3 9
2000-07-31 7 6 1 0
2000-10-31 10 1 9 0
2001-01-31 1 2 2 6
2001-04-30 4 1 3 9
2001-07-31 7 6 1 0
2001-10-31 10 1 9 0
2002-01-31 1 2 2 6
2002-04-30 4 1 3 9
df = data_org - mean_data1
print (df)
0 1 2 m
2000-01-31 0 0 0 0
2000-04-30 0 0 0 0
2000-07-31 0 0 0 0
2000-10-31 0 0 0 0
2001-01-31 -2 7 -3 0
2001-04-30 3 -3 -9 0
2001-07-31 -2 0 7 0
2001-10-31 2 -7 4 0
2002-01-31 5 0 -2 0
2002-04-30 7 -3 -2 0

Sum of dataframe columns to another dataframe column Python gives NaN

I want to sumarize rows and columns of dataframe (pdf and wdf) and save results in another dataframe columns (to_hex).
I tried it for one dataframe and it worked. It doesn't work for another (it gives NaN). I cannot understand what is the difference.
to_hex = pd.DataFrame(0, index=np.arange(len(sasiedztwo)), columns=['ID','podroze','p_rozmyte'])
to_hex.loc[:,'ID']= wdf.index+1
to_hex.index=pdf.index
to_hex.loc[:,'podroze']= pd.DataFrame(pdf.sum(axis=0))[:]
to_hex.index=wdf.index
to_hex.loc[:,'p_rozmyte']= pd.DataFrame(wdf.sum(axis=0))[:]
This is how pdf dataframe looks like:
0 1 2 3 4 5 6 7 8
0 0 0 10 0 0 0 0 0 100
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0 1000
8 0 0 0 0 0 0 0 0 0
This is wdf:
0 1 2 3 4 5 6 7 8
0 2.5 5.0 35.0 0.0 27.5 55.0 25.0 50.0 102.5
1 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
2 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 25.0
3 0.0 0.0 5.0 0.0 0.0 0.0 0.0 0.0 300.0
4 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 250.0
6 0.0 0.0 2.5 0.0 0.0 0.0 0.0 0.0 525.0
7 0.0 0.0 250.0 0.0 250.0 500.0 250.0 500.0 1000.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 500.0
And this is the result in to_hex:
ID podroze p_rozmyte
0 1 0 NaN
1 2 0 NaN
2 3 10 NaN
3 4 0 NaN
4 5 0 NaN
5 6 0 NaN
6 7 0 NaN
7 8 0 NaN
8 9 1100 NaN
SOLUTION:
One option to solve it is to modify your code as follows:
to_hex.loc[:,'ID']= wdf.index+1
# to_hex.index=pdf.index # no need
to_hex.loc[:,'podroze']= pdf.sum(axis=0) # modified; directly use the series output from SUM()
# to_hex.index=wdf.index # no need
to_hex.loc[:,'p_rozmyte']= wdf.sum(axis=0) # modified
Then you get:
ID podroze p_rozmyte
0 1 0 2.5
1 2 0 5.0
2 3 10 302.5
3 4 0 0.0
4 5 0 277.5
5 6 0 555.0
6 7 0 275.0
7 8 0 550.0
8 9 1100 3527.5
I think the reason that you get NaN for one case and correct values for the other case lies in to_hex.dtypes:
ID int64
podroze int64
p_rozmyte int64
dtype: object
And as you see to_hex dataframe has column types as int64. This is fine when you add pdf dataframe (since it has the same dtype)
pd.DataFrame(pdf.sum(axis=0))[:].dtypes
0 int64
dtype: object
but does not work when you add wdf:
pd.DataFrame(wdf.sum(axis=0))[:].dtypes
0 float64
dtype: object

Is there a way to apply a function to a MultiIndex dataframe slice with the same outer index without iterating each slice?

Basically, what I'm trying to accomplish is to fill the missing dates (creating new DataFrame rows) with respect to each product, then create a new column based on a cumulative sum of column 'A' (example shown below)
The data is a MultiIndex with (product, date) as indexes.
Basically I would like to apply this answer to a MultiIndex DataFrame using only the rightmost index and calculating a subsequent np.cumsum for each product (and all dates).
A
product date
0 2017-01-02 1
2017-01-03 2
2017-01-04 2
2017-01-05 1
2017-01-06 4
2017-01-07 1
2017-01-10 7
1 2018-06-29 1
2018-06-30 4
2018-07-01 1
2018-07-02 1
2018-07-04 2
What I want to accomplish (efficiently) is:
A CumSum
product date
0 2017-01-02 1 1
2017-01-03 2 3
2017-01-04 2 5
2017-01-05 1 6
2017-01-06 4 10
2017-01-07 1 11
2017-01-08 0 11
2017-01-09 0 11
2017-01-10 7 18
1 2018-06-29 1 1
2018-06-30 4 5
2018-07-01 1 6
2018-07-02 1 7
2018-07-03 0 7
2018-07-04 2 9
You have 2 ways:
One way:
Using groupby with apply and with resample and cumsum. Finally, pd.concat result with df.A and fillna with 0
s = (df.reset_index(0).groupby('product').apply(lambda x: x.resample(rule='D')
.asfreq(0).A.cumsum()))
pd.concat([df.A, s.rename('cumsum')], axis=1).fillna(0)
Out[337]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9
Another way:
you need 2 groupbys. First one for resample, 2nd one for cumsum. Finally, use pd.concat and fillna with 0
s1 = df.reset_index(0).groupby('product').resample(rule='D').asfreq(0).A
pd.concat([df.A, s1.groupby(level=0).cumsum().rename('cumsum')], axis=1).fillna(0)
Out[351]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9

Fill missing dates

I have dataframe contains temperature readings from different areas and in different dates
I want to add the missing dates for each location with zero temperature
for example:
df=pd.DataFrame({"area_id":[1,1,1,2,2,2,3,3,3],
"reading_date":["13/1/2017","15/1/2017"
,"16/1/2017","22/3/2017","26/3/2017"
,"28/3/2017","15/5/2017"
,"16/5/2017","18/5/2017"],
"temp":[12,15,22,6,14,8,30,25,33]})
What is the most efficient way to fill dates gap per area (by zeros) as shown below
Many Thanks.
Use:
first convert to datetime column reading_date by to_datetime
set_index for DatetimeIndex and groupby with resample
for Series add asfreq
replace NaNs by fillna
last add reset_index for columns from MultiIndex
df['reading_date'] = pd.to_datetime(df['reading_date'])
df = (df.set_index('reading_date')
.groupby('area_id')
.resample('d')['temp']
.asfreq()
.fillna(0)
.reset_index())
print (df)
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0
Using reindex. Define a custom function to handle the reindexing operation, and call it inside groupby.apply.
def reindex(x):
# Thanks to #jezrael for the improvement.
return x.reindex(pd.date_range(x.index.min(), x.index.max()), fill_value=0)
Next, convert reading_date to datetime first, using pd.to_datetime,
df.reading_date = pd.to_datetime(df.reading_date)
Now, perform a groupby.
df = (
df.set_index('reading_date')
.groupby('area_id')
.temp
.apply(reindex)
.reset_index()
)
df.columns = ['area_id', 'reading_date', 'temp']
df
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0

Categories

Resources