I have the following dataframe:
Date abc xyz
01-Jun-13 100 200
03-Jun-13 -20 50
15-Aug-13 40 -5
20-Jan-14 25 15
21-Feb-14 60 80
I need to group the data by year and month. I.e., Group by Jan 2013, Feb 2013, Mar 2013, etc...
I will be using the newly grouped data to create a plot showing abc vs xyz per year/month.
I've tried various combinations of groupby and sum, but I just can't seem to get anything to work. How can I do it?
You can use either resample or Grouper (which resamples under the hood).
First make sure that the datetime column is actually of datetimes (hit it with pd.to_datetime). It's easier if it's a DatetimeIndex:
In [11]: df1
Out[11]:
abc xyz
Date
2013-06-01 100 200
2013-06-03 -20 50
2013-08-15 40 -5
2014-01-20 25 15
2014-02-21 60 80
In [12]: g = df1.groupby(pd.Grouper(freq="M")) # DataFrameGroupBy (grouped by Month)
In [13]: g.sum()
Out[13]:
abc xyz
Date
2013-06-30 80 250
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
In [14]: df1.resample("M", how='sum') # the same
Out[14]:
abc xyz
Date
2013-06-30 40 125
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
Note: Previously pd.Grouper(freq="M") was written as pd.TimeGrouper("M"). The latter is now deprecated since 0.21.
I had thought the following would work, but it doesn't (due to as_index not being respected? I'm not sure.). I'm including this for interest's sake.
If it's a column (it has to be a datetime64 column! as I say, hit it with to_datetime), you can use the PeriodIndex:
In [21]: df
Out[21]:
Date abc xyz
0 2013-06-01 100 200
1 2013-06-03 -20 50
2 2013-08-15 40 -5
3 2014-01-20 25 15
4 2014-02-21 60 80
In [22]: pd.DatetimeIndex(df.Date).to_period("M") # old way
Out[22]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-06, ..., 2014-02]
Length: 5, Freq: M
In [23]: per = df.Date.dt.to_period("M") # new way to get the same
In [24]: g = df.groupby(per)
In [25]: g.sum() # dang not quite what we want (doesn't fill in the gaps)
Out[25]:
abc xyz
2013-06 80 250
2013-08 40 -5
2014-01 25 15
2014-02 60 80
To get the desired result we have to reindex...
Keep it simple:
GB = DF.groupby([(DF.index.year), (DF.index.month)]).sum()
giving you,
print(GB)
abc xyz
2013 6 80 250
8 40 -5
2014 1 25 15
2 60 80
and then you can plot like asked using,
GB.plot('abc', 'xyz', kind='scatter')
There are different ways to do that.
I created the data frame to showcase the different techniques to filter your data.
df = pd.DataFrame({'Date': ['01-Jun-13', '03-Jun-13', '15-Aug-13', '20-Jan-14', '21-Feb-14'],
'abc': [100, -20, 40, 25, 60], 'xyz': [200, 50,-5, 15, 80] })
I separated months/year/day and separated month-year as you explained.
def getMonth(s):
return s.split("-")[1]
def getDay(s):
return s.split("-")[0]
def getYear(s):
return s.split("-")[2]
def getYearMonth(s):
return s.split("-")[1] + "-" + s.split("-")[2]
I created new columns: year, month, day and 'yearMonth'. In your case, you need one of both. You can group using two columns 'year','month' or using one column yearMonth
df['year'] = df['Date'].apply(lambda x: getYear(x))
df['month'] = df['Date'].apply(lambda x: getMonth(x))
df['day'] = df['Date'].apply(lambda x: getDay(x))
df['YearMonth'] = df['Date'].apply(lambda x: getYearMonth(x))
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
3 20-Jan-14 25 15 14 Jan 20 Jan-14
4 21-Feb-14 60 80 14 Feb 21 Feb-14
You can go through the different groups in groupby(..) items.
In this case, we are grouping by two columns:
for key, g in df.groupby(['year', 'month']):
print key, g
Output:
('13', 'Jun') Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
('13', 'Aug') Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
('14', 'Jan') Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
('14', 'Feb') Date abc xyz year month day YearMonth
In this case, we are grouping by one column:
for key, g in df.groupby(['YearMonth']):
print key, g
Output:
Jun-13 Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
Aug-13 Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
Jan-14 Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
Feb-14 Date abc xyz year month day YearMonth
4 21-Feb-14 60 80 14 Feb 21 Feb-14
In case you want to access a specific item, you can use get_group
print df.groupby(['YearMonth']).get_group('Jun-13')
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
Similar to get_group. This hack would help to filter values and get the grouped values.
This also would give the same result.
print df[df['YearMonth']=='Jun-13']
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
You can select list of abc or xyz values during Jun-13
print df[df['YearMonth']=='Jun-13'].abc.values
print df[df['YearMonth']=='Jun-13'].xyz.values
Output:
[100 -20] #abc values
[200 50] #xyz values
You can use this to go through the dates that you have classified as "year-month" and apply criteria on it to get related data.
for x in set(df.YearMonth):
print df[df['YearMonth']==x].abc.values
print df[df['YearMonth']==x].xyz.values
I recommend also to check this answer as well.
You can also do it by creating a string column with the year and month as follows:
df['date'] = df.index
df['year-month'] = df['date'].apply(lambda x: str(x.year) + ' ' + str(x.month))
grouped = df.groupby('year-month')
However this doesn't preserve the order when you loop over the groups, e.g.
for name, group in grouped:
print(name)
Will give:
2007 11
2007 12
2008 1
2008 10
2008 11
2008 12
2008 2
2008 3
2008 4
2008 5
2008 6
2008 7
2008 8
2008 9
2009 1
2009 10
So then, if you want to preserve the order, you must do as suggested by #Q-man above:
grouped = df.groupby([df.index.year, df.index.month])
This will preserve the order in the above loop:
(2007, 11)
(2007, 12)
(2008, 1)
(2008, 2)
(2008, 3)
(2008, 4)
(2008, 5)
(2008, 6)
(2008, 7)
(2008, 8)
(2008, 9)
(2008, 10)
Some of the answers are using Date as an index instead of a column (and there's nothing wrong with doing that).
However, for anyone who has the dates stored as a column (instead of an index), remember to access the column's dt attribute. That is:
# First make sure `Date` is a datetime column
df['Date'] = pd.to_datetime(
arg=df['Date'],
format='%d-%b-%y' # Assuming dd-Mon-yy format
)
# Group by year and month
df.groupby(
[
df['Date'].dt.year,
df['Date'].dt.month
]
).sum()
Related
I am currently working on python to add rows quarter by quarter.
The dataframe that I'm working with looks like below:
df = [['A','2021-03',1,9,17,25], ['A','2021-06',2,10,18,26], ['A','2021-09',3,11,19,27], ['A','2021-12',4,12,20,28],
['B','2021-03',5,13,21,29], ['B','2021-06',6,14,22,30], ['B','2022-03',7,15,23,31], ['B','2022-06',8,16,24,32]]
df_fin = pd.DataFrame(df, columns=['ID','Date','Value_1','Value_2','Value_3','Value_4'])
The Dataframe has 'ID', 'Date' column and three columns that are subjected for summation.
The 'Date' is in the form of 20XX-03, 20XX-06, 20XX-09, 20XX-12.
Within the same 'ID' value, I want to add the rows to make it to biannual dates. In other words, I want to add March with June, and add September with December
The final df will look like below:
ID
Date
Value_1
Value_2
Value_3
Value_4
A
2021-06
3
19
35
51
A
2021-12
7
23
39
55
B
2021-06
11
26
42
59
B
2022-06
15
31
47
63
you can use groupby
df_fin['temp'] = df_fin['Date'].replace({'-03': '-06', '-09':'-12'}, regex=True)
df_fin.groupby(['ID', 'temp']).sum().reset_index().rename(columns={'temp': 'Date'})
ID Date Value_1 Value_2 Value_3 Value_4
0 A 2021-06 3 19 35 51
1 A 2021-12 7 23 39 55
2 B 2021-06 11 27 43 59
3 B 2022-12 15 31 47 63
I have a pandas dataframe as such:
id =[30,30,40,40,30,40,55,30]
month =[1,3,11,4,10,2,12,12]
average=[90,80,50,92,18,15,16,55]
sec =['id1','id1','id3','id4','id2','id2','id1','id1']
df = pd.DataFrame(list(zip(id,sec,month,average)),columns =['id','sec','month','Average'])
We want to add one more column having comma separated months of below conditions
Need to exclude id2 sec
and below 90 average
Desired Output
I have tried below code but not getting desired output
final=pd.DataFrame()
for i in set(sec):
if i !='id2': #Exclude id2
d2 =df[df['sec']==i]
d2=df[df['average']<90] # apply below 90 condition
d2=d2[['id','month']].groupby(['id'], as_index=False).agg(lambda x: ', '.join(sorted(set(x.astype(str))))) #comma seperated data
d2.rename(columns={'month':'problematic_month'},inplace=True)
d2['sec']=i
tab =df.merge(d2,on =['id','sec'], how ='inner')
final =final.append(tab)
else:
d2 =df[df['sec']==i]
d2['problematic_month']=np.NaN
final =final.append(d2)
Kindly suggest any other way(without merge) to get the desired output
Another way using groupby+transform
import calendar
d = dict(enumerate(calendar.month_abbr))
s = df['month'].map(d).where(df['sec'].ne("id2")& (df['Average'].lt(90)))
col = s.groupby([df["id"],df['sec']]).transform(lambda x: ','.join(x.dropna()))
out = df.assign(problematic_column=col.replace("",np.nan)).sort_values(['id','sec'])
print(out)
id sec month Average problematic_column
0 30 id1 1 90 Mar,Dec
1 30 id1 3 80 Mar,Dec
7 30 id1 12 55 Mar,Dec
4 30 id2 10 18 NaN
5 40 id2 2 15 NaN
2 40 id3 11 50 Nov
3 40 id4 4 92 NaN
6 55 id1 12 16 Dec
Steps:
Map the month column to the calender to get month abbreviation.
Retain values only when the condition matches.
Use groupby and transform to dropna and join by comma.
You can start by first converting your int months to actual Month abbreviations using calendar.
df['month'] = df['month'].apply(lambda x: calendar.month_abbr[x])
print(df.head(3))
id sec month Average
0 30 id1 Jan 90
1 30 id1 Mar 80
2 40 id3 Nov 50
Then I would use loc to narrow your dataframe based on your conditions above and a groupby and to get your months together per sec.
Thereafter use map to attach it to your initial dataframe:
r = df.loc[(df['Average'].gt(90) |\
(df['sec'].eq('id2'))).eq(0)]\
.groupby('sec').agg({'month':lambda x: ','.join(x)})\
.reset_index()\
.rename({'month':'problematic_month'},axis=1)
print(r)
sec problematic_month
0 id1 Jan,Mar,Dec
1 id3 Nov
# Attach with map
df['problematic_month'] = df['sec'].map(dict(zip(r.sec,r.problematic_month)))
>>> print(df)
id sec month Average problematic_month
0 30 id1 Jan 90 Jan,Mar,Dec
1 30 id1 Mar 80 Jan,Mar,Dec
2 40 id3 Nov 50 Nov
3 40 id4 Apr 92 NaN
4 30 id2 Oct 18 NaN
5 40 id2 Feb 15 NaN
6 55 id1 Dec 16 Jan,Mar,Dec
Then using this problematic_month column, you can check whether it contains a , and it it does you can select the first and last column:
import numpy as np
f = df['problematic_month'].str.split(',').str[0]
l = ',' + df['problematic_month'].str.split(',').str[-1]
df['problematic_month'] = np.where(df['problematic_month'].str.contains(','),f+l, df['problematic_month'])
Answer:
>>> print(df)
id sec month Average problematic_month
0 30 id1 Jan 90 Jan,Dec
1 30 id1 Mar 80 Jan,Dec
2 40 id3 Nov 50 Nov
3 40 id4 Apr 92 NaN
4 30 id2 Oct 18 NaN
5 40 id2 Feb 15 NaN
6 55 id1 Dec 16 Jan,Dec
I have a pandas Dataframe that looks like this:
year month name value1 value2
0 2021 7 cars 5000 4000
1 2021 7 boats 2000 250
2 2021 9 cars 3000 7000
And I want it to look like this:
year month day name value1 value2
0 2021 7 1 cars 161.29 129.03
1 2021 7 2 cars 161.29 129.03
2 2021 7 3 cars 161.29 129.03
3 2021 7 4 cars 161.29 129.03
...
31 2021 7 1 boats 64.51 8.064
32 2021 7 2 boats 64.51 8.064
33 2021 7 3 boats 64.51 8.064
...
62 2021 9 1 cars 100 233.33
63 2021 9 1 cars 100 233.33
64 2021 9 1 cars 100 233.33
The idea is that i want to divide the value columns by the number of days in the month, and create a day column so that in the end i can achieve a date column concatenating year, month and day.
Can anyone help me?
One option would be to use monthrange from calendar to get the number of days in a given month, divide the value by days in the month, then use Index.repeat to scale up the DataFrame and groupby cumcount to add in the Days:
from calendar import monthrange
import pandas as pd
df = pd.DataFrame(
{'year': {0: 2021, 1: 2021, 2: 2021}, 'month': {0: 7, 1: 7, 2: 9},
'name': {0: 'cars', 1: 'boats', 2: 'cars'},
'value1': {0: 5000, 1: 2000, 2: 3000},
'value2': {0: 4000, 1: 250, 2: 7000}})
days_in_month = (
df[['year', 'month']].apply(lambda x: monthrange(*x)[1], axis=1)
)
# Calculate new values
df.loc[:, 'value1':] = df.loc[:, 'value1':].div(days_in_month, axis=0)
df = df.loc[df.index.repeat(days_in_month)] # Scale Up DataFrame
df.insert(2, 'day', df.groupby(level=0).cumcount() + 1) # Add Days Column
df = df.reset_index(drop=True) # Clean up Index
df:
year month day name value1 value2
0 2021 7 1 cars 161.290323 129.032258
1 2021 7 2 cars 161.290323 129.032258
2 2021 7 3 cars 161.290323 129.032258
3 2021 7 4 cars 161.290323 129.032258
4 2021 7 5 cars 161.290323 129.032258
.. ... ... ... ... ... ...
87 2021 9 26 cars 100.000000 233.333333
88 2021 9 27 cars 100.000000 233.333333
89 2021 9 28 cars 100.000000 233.333333
90 2021 9 29 cars 100.000000 233.333333
91 2021 9 30 cars 100.000000 233.333333
for that you need to create a temp dataframe that will include the days in each month, then merge it, then divide the values
let's assume that you have data single year, so we can create the date range from it straight away, and create the temp dataframe:
dt_range = pd.DatFrame(pd.date_range(df.loc[0,'year'] + '-01-01', periods=365))
dt_range.columns = ['dte']
dt_range['year'] = dt_range['dte'].dt.year
dt_range['month'] = dt_range['dte'].dt.month
dt_range['day'] = dt_range['dte'].dt.day
now we can create the new dataframe:
new_df = pd.merge(df, dt_range,how='left',on=['year','month'])
now all we have to do is group by and merge, and we have what you needed
new_df = new_df.groupby(['year','month','day']).agg({'value':'mean'})
You can use resample to upsample months into days:
import pandas as pd
df = pd.DataFrame([[2021,7,5000]], columns=['year', 'month', 'value'])
# create datetime column as period
df['datetime'] = pd.to_datetime(df['month'].astype(str) + '/' + df['year'].astype(str)).dt.to_period("M")
# calculate values per day by dividing the value by number of days per month
df['ndays'] = df['datetime'].apply(lambda x: x.days_in_month)
df['value'] = df['value'] / df['ndays']
# set datetime as index and resample:
df = df[['value', 'datetime']].set_index('datetime')
df = df.resample('d').ffill().reset_index()
#split datetime to separate columns
df['day'] = df['datetime'].dt.day
df['month'] = df['datetime'].dt.month
df['year'] = df['datetime'].dt.year
df.drop(columns=['datetime'], inplace=True)
value
day
month
year
0
161.29
1
7
2021
1
161.29
2
7
2021
2
161.29
3
7
2021
3
161.29
4
7
2021
4
161.29
5
7
2021
I assume dataframe can have more months, for example extending a little Your initial dataframe:
df = pd.read_csv(StringIO("""
year month value
2021 7 5000
2021 8 5000
2021 9 5000
"""), sep = "\t")
Which gives dataframe df:
year month value
0 2021 7 5000
1 2021 8 5000
2 2021 9 5000
Solution is simple one-liner: first datetime index is created from raw year and month, then resample method is used to convert months to days, finally value is overwritten by calculating average per day in every month:
df_out = (
df.set_index(pd.DatetimeIndex(pd.to_datetime(dict(year=df.year, month=df.month, day=1)), freq="MS"))
.resample('D')
.ffill()
.assign(value = lambda df: df.value/df.index.days_in_month)
)
Resulting dataframe:
year month value
2021-07-01 2021 7 161.290323
2021-07-02 2021 7 161.290323
2021-07-03 2021 7 161.290323
2021-07-04 2021 7 161.290323
2021-07-05 2021 7 161.290323
... ... ...
2021-08-28 2021 8 161.290323
2021-08-29 2021 8 161.290323
2021-08-30 2021 8 161.290323
2021-08-31 2021 8 161.290323
2021-09-01 2021 9 166.666667
Please note September has only 30 days, so value is different than in previous months.
I have a csv file in the format:
20 05 2019 12:00:00, 100
21 05 2019 12:00:00, 200
22 05 2019 12:00:00, 480
And i want to access the second variable, ive tried a variety of different alterations but none have worked.
Initially i tried
import pandas as pd
import numpy as np
col = [i for i in range(2)]
col[1] = "Power"
data = pd.read_csv('FILENAME.csv', names=col)
df1 = data.sum(data, axis=1)
df2 = np.cumsum(df1)
print(df2)
You can use cumsum function:
data['Power'].cumsum()
Output:
0 100
1 300
2 780
Name: Power, dtype: int64
Use df.cumsum:
In [1820]: df = pd.read_csv('FILENAME.csv', names=col)
In [1821]: df
Out[1821]:
0 Power
0 20 05 2019 12:00:00 100
1 21 05 2019 12:00:00 200
2 22 05 2019 12:00:00 480
In [1823]: df['cumulative sum'] = df['Power'].cumsum()
In [1824]: df
Out[1824]:
0 Power cumulative sum
0 20 05 2019 12:00:00 100 100
1 21 05 2019 12:00:00 200 300
2 22 05 2019 12:00:00 480 780
I have a pandas column like this :
yrmnt
--------
2015 03
2015 03
2013 08
2015 08
2014 09
2015 10
2016 02
2015 11
2015 11
2015 11
2017 02
How to fetch lowest year month combination :2013 08 and highest : 2017 02
And find the difference in months between these two, ie 40
You can connvert column to_datetime and then find indices by max and min values by idxmax and
idxmin:
a = pd.to_datetime(df['yrmnt'], format='%Y %m')
print (a)
0 2015-03-01
1 2015-03-01
2 2013-08-01
3 2015-08-01
4 2014-09-01
5 2015-10-01
6 2016-02-01
7 2015-11-01
8 2015-11-01
9 2015-11-01
10 2017-02-01
Name: yrmnt, dtype: datetime64[ns]
print (df.loc[a.idxmax(), 'yrmnt'])
2017 02
print (df.loc[a.idxmin(), 'yrmnt'])
2013 08
Difference in months:
b = a.dt.to_period('M')
d = b.max() - b.min()
print (d)
42
Another solution working only with month period created by Series.dt.to_period:
b = pd.to_datetime(df['yrmnt'], format='%Y %m').dt.to_period('M')
print (b)
0 2015-03
1 2015-03
2 2013-08
3 2015-08
4 2014-09
5 2015-10
6 2016-02
7 2015-11
8 2015-11
9 2015-11
10 2017-02
Name: yrmnt, dtype: object
Then convert to custom format by Period.strftime minimal and maximal values:
min_d = b.min().strftime('%Y %m')
print (min_d)
2013 08
max_d = b.max().strftime('%Y %m')
print (max_d)
2017 02
And subtract for difference:
d = b.max() - b.min()
print (d)
42