I am currently working on python to add rows quarter by quarter.
The dataframe that I'm working with looks like below:
df = [['A','2021-03',1,9,17,25], ['A','2021-06',2,10,18,26], ['A','2021-09',3,11,19,27], ['A','2021-12',4,12,20,28],
['B','2021-03',5,13,21,29], ['B','2021-06',6,14,22,30], ['B','2022-03',7,15,23,31], ['B','2022-06',8,16,24,32]]
df_fin = pd.DataFrame(df, columns=['ID','Date','Value_1','Value_2','Value_3','Value_4'])
The Dataframe has 'ID', 'Date' column and three columns that are subjected for summation.
The 'Date' is in the form of 20XX-03, 20XX-06, 20XX-09, 20XX-12.
Within the same 'ID' value, I want to add the rows to make it to biannual dates. In other words, I want to add March with June, and add September with December
The final df will look like below:
ID
Date
Value_1
Value_2
Value_3
Value_4
A
2021-06
3
19
35
51
A
2021-12
7
23
39
55
B
2021-06
11
26
42
59
B
2022-06
15
31
47
63
you can use groupby
df_fin['temp'] = df_fin['Date'].replace({'-03': '-06', '-09':'-12'}, regex=True)
df_fin.groupby(['ID', 'temp']).sum().reset_index().rename(columns={'temp': 'Date'})
ID Date Value_1 Value_2 Value_3 Value_4
0 A 2021-06 3 19 35 51
1 A 2021-12 7 23 39 55
2 B 2021-06 11 27 43 59
3 B 2022-12 15 31 47 63
Related
I have 2 dataframes:
df1:
artist_id
concert_date
region_id
12345
2019-10
22
33322
2018-11
44
df2:
artist_id
date
region_id
popularity
12345
2019-10
22
76
12345
2019-11
44
23
I need to add the median of the artist's popularity (which needs to be calculated only for the last 3 months before the concert and only for the same region) from the second table to the first.
That is, the first table should look like (figures are invented, the point is not in them now):
df1:
artist_id
concert_date
region_id
popularity_median_last3month
12345
2019-10
22
55
33322
2018-11
44
44
Right now I'm using the following loop:
df1['popularity_median_last3month'] = pd.Series(dtype='int')
for i in range(len(df1)):
df1['popularity_median_last3month'].values[i] = df2[(df2.artist_id==df1.artist_id.values[i])&(df2.region_id==df1.region_id.values[i])&(df2.date<=df1.concert_date.values[i])][-3:].popularity.median()
however, it takes too long with a large amount of data.
Please tell me how to avoid the loop
Here's a way to do this without a python loop:
df3 = df1.merge(df2, on=['artist_id', 'region_id'])
df3 = df3[df3.date >= df3.concert_date - pd.DateOffset(months=3)]
df3 = df3.groupby(['artist_id', 'region_id', 'concert_date']).median().rename(
columns={'popularity':'popularity_median_last3month'})
df1 = df1.join(df3, on=['artist_id', 'region_id', 'concert_date'])
Input:
df1
artist_id concert_date region_id
0 12345 2019-10-01 22
1 33322 2018-11-01 44
2 12345 2019-12-01 22
df2
artist_id date region_id popularity
0 12345 2019-10-01 22 76
1 12345 2019-11-01 44 23
2 12345 2019-11-01 22 50
3 12345 2019-08-01 22 68
Output:
artist_id concert_date region_id popularity_median_last3month
0 12345 2019-10-01 22 68.0
1 33322 2018-11-01 44 NaN
2 12345 2019-12-01 22 63.0
Explanation:
Use merge() to create a dataframe with one row per artist_id, region_id tuple
Filter this to contain only rows where the date (corresponding to the popularity data point) from df2 is within 3 months of concert_date
Use groupby() to get the median popularity for each artist_id, region_id tuple, and rename this column as popularity_median_last3month
Use join() to add the popularity_median_last3month column to df1.
I have a pandas dataframe as such:
id =[30,30,40,40,30,40,55,30]
month =[1,3,11,4,10,2,12,12]
average=[90,80,50,92,18,15,16,55]
sec =['id1','id1','id3','id4','id2','id2','id1','id1']
df = pd.DataFrame(list(zip(id,sec,month,average)),columns =['id','sec','month','Average'])
We want to add one more column having comma separated months of below conditions
Need to exclude id2 sec
and below 90 average
Desired Output
I have tried below code but not getting desired output
final=pd.DataFrame()
for i in set(sec):
if i !='id2': #Exclude id2
d2 =df[df['sec']==i]
d2=df[df['average']<90] # apply below 90 condition
d2=d2[['id','month']].groupby(['id'], as_index=False).agg(lambda x: ', '.join(sorted(set(x.astype(str))))) #comma seperated data
d2.rename(columns={'month':'problematic_month'},inplace=True)
d2['sec']=i
tab =df.merge(d2,on =['id','sec'], how ='inner')
final =final.append(tab)
else:
d2 =df[df['sec']==i]
d2['problematic_month']=np.NaN
final =final.append(d2)
Kindly suggest any other way(without merge) to get the desired output
Another way using groupby+transform
import calendar
d = dict(enumerate(calendar.month_abbr))
s = df['month'].map(d).where(df['sec'].ne("id2")& (df['Average'].lt(90)))
col = s.groupby([df["id"],df['sec']]).transform(lambda x: ','.join(x.dropna()))
out = df.assign(problematic_column=col.replace("",np.nan)).sort_values(['id','sec'])
print(out)
id sec month Average problematic_column
0 30 id1 1 90 Mar,Dec
1 30 id1 3 80 Mar,Dec
7 30 id1 12 55 Mar,Dec
4 30 id2 10 18 NaN
5 40 id2 2 15 NaN
2 40 id3 11 50 Nov
3 40 id4 4 92 NaN
6 55 id1 12 16 Dec
Steps:
Map the month column to the calender to get month abbreviation.
Retain values only when the condition matches.
Use groupby and transform to dropna and join by comma.
You can start by first converting your int months to actual Month abbreviations using calendar.
df['month'] = df['month'].apply(lambda x: calendar.month_abbr[x])
print(df.head(3))
id sec month Average
0 30 id1 Jan 90
1 30 id1 Mar 80
2 40 id3 Nov 50
Then I would use loc to narrow your dataframe based on your conditions above and a groupby and to get your months together per sec.
Thereafter use map to attach it to your initial dataframe:
r = df.loc[(df['Average'].gt(90) |\
(df['sec'].eq('id2'))).eq(0)]\
.groupby('sec').agg({'month':lambda x: ','.join(x)})\
.reset_index()\
.rename({'month':'problematic_month'},axis=1)
print(r)
sec problematic_month
0 id1 Jan,Mar,Dec
1 id3 Nov
# Attach with map
df['problematic_month'] = df['sec'].map(dict(zip(r.sec,r.problematic_month)))
>>> print(df)
id sec month Average problematic_month
0 30 id1 Jan 90 Jan,Mar,Dec
1 30 id1 Mar 80 Jan,Mar,Dec
2 40 id3 Nov 50 Nov
3 40 id4 Apr 92 NaN
4 30 id2 Oct 18 NaN
5 40 id2 Feb 15 NaN
6 55 id1 Dec 16 Jan,Mar,Dec
Then using this problematic_month column, you can check whether it contains a , and it it does you can select the first and last column:
import numpy as np
f = df['problematic_month'].str.split(',').str[0]
l = ',' + df['problematic_month'].str.split(',').str[-1]
df['problematic_month'] = np.where(df['problematic_month'].str.contains(','),f+l, df['problematic_month'])
Answer:
>>> print(df)
id sec month Average problematic_month
0 30 id1 Jan 90 Jan,Dec
1 30 id1 Mar 80 Jan,Dec
2 40 id3 Nov 50 Nov
3 40 id4 Apr 92 NaN
4 30 id2 Oct 18 NaN
5 40 id2 Feb 15 NaN
6 55 id1 Dec 16 Jan,Dec
I have a shipping records table with approx. 100K rows and
I want to calculate, for each row, for each material, how many qtys were shipped in last 30 days.
As you can see in below example, calculated qty depends on "material, shipping date".
I've tried to write very basic code and couldn't find a way to apply it to all rows.
df[(df['malzeme']==material) & (df['cikistarihi'] < shippingDate) & (df['cikistarihi'] >= (shippingDate-30))]['qty'].sum()
material
shippingDate
qty
shipped qtys in last 30 days
A
23.01.2019
8
0
A
28.01.2019
41
8
A
31.01.2019
66
49 (8+41)
A
20.03.2019
67
0
B
17.02.2019
53
0
B
26.02.2019
35
53
B
11.03.2019
4
88 (53+35)
B
20.03.2019
67
106 (35+4+67)
You can use .groupby with .rolling:
# convert the shippingData to datetime:
df["shippingDate"] = pd.to_datetime(df["shippingDate"], dayfirst=True)
# sort the values (if they aren't already)
df = df.sort_values(["material", "shippingDate"])
df["shipped qtys in last 30 days"] = (
df.groupby("material")
.rolling("30D", on="shippingDate", closed="left")["qty"]
.sum()
.fillna(0)
.values
)
print(df)
Prints:
material shippingDate qty shipped qtys in last 30 days
0 A 2019-01-23 8 0.0
1 A 2019-01-28 41 8.0
2 A 2019-01-31 66 49.0
3 A 2019-03-20 67 0.0
4 B 2019-02-17 53 0.0
5 B 2019-02-26 35 53.0
6 B 2019-03-11 4 88.0
7 B 2019-03-20 67 39.0
EDIT: Add .sort_values() before groupby
Data frame 1:
Stationid
10
11
12
13
14
15
16
17
Data frame 2:
Stationid Maintanance
10 55
15 38
21 100
10 56
22 101
15 39
10 56
I need to calculate mean for station id's in dataframe 1 on dataframe 2
Expected output:
Stationid Maintainance Mean
10 55.666667
15 38.500000
First filter by isin with boolean indexing and then aggregate mean:
df = df2[df2['id'].isin(df1['Stationid'])].groupby('id', as_index=False)['Maintanance'].mean()
df.columns = ['Stationid','Maintainance Mean']
print (df)
Stationid Maintainance Mean
0 10 55.666667
1 15 38.500000
I have the following dataframe:
Date abc xyz
01-Jun-13 100 200
03-Jun-13 -20 50
15-Aug-13 40 -5
20-Jan-14 25 15
21-Feb-14 60 80
I need to group the data by year and month. I.e., Group by Jan 2013, Feb 2013, Mar 2013, etc...
I will be using the newly grouped data to create a plot showing abc vs xyz per year/month.
I've tried various combinations of groupby and sum, but I just can't seem to get anything to work. How can I do it?
You can use either resample or Grouper (which resamples under the hood).
First make sure that the datetime column is actually of datetimes (hit it with pd.to_datetime). It's easier if it's a DatetimeIndex:
In [11]: df1
Out[11]:
abc xyz
Date
2013-06-01 100 200
2013-06-03 -20 50
2013-08-15 40 -5
2014-01-20 25 15
2014-02-21 60 80
In [12]: g = df1.groupby(pd.Grouper(freq="M")) # DataFrameGroupBy (grouped by Month)
In [13]: g.sum()
Out[13]:
abc xyz
Date
2013-06-30 80 250
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
In [14]: df1.resample("M", how='sum') # the same
Out[14]:
abc xyz
Date
2013-06-30 40 125
2013-07-31 NaN NaN
2013-08-31 40 -5
2013-09-30 NaN NaN
2013-10-31 NaN NaN
2013-11-30 NaN NaN
2013-12-31 NaN NaN
2014-01-31 25 15
2014-02-28 60 80
Note: Previously pd.Grouper(freq="M") was written as pd.TimeGrouper("M"). The latter is now deprecated since 0.21.
I had thought the following would work, but it doesn't (due to as_index not being respected? I'm not sure.). I'm including this for interest's sake.
If it's a column (it has to be a datetime64 column! as I say, hit it with to_datetime), you can use the PeriodIndex:
In [21]: df
Out[21]:
Date abc xyz
0 2013-06-01 100 200
1 2013-06-03 -20 50
2 2013-08-15 40 -5
3 2014-01-20 25 15
4 2014-02-21 60 80
In [22]: pd.DatetimeIndex(df.Date).to_period("M") # old way
Out[22]:
<class 'pandas.tseries.period.PeriodIndex'>
[2013-06, ..., 2014-02]
Length: 5, Freq: M
In [23]: per = df.Date.dt.to_period("M") # new way to get the same
In [24]: g = df.groupby(per)
In [25]: g.sum() # dang not quite what we want (doesn't fill in the gaps)
Out[25]:
abc xyz
2013-06 80 250
2013-08 40 -5
2014-01 25 15
2014-02 60 80
To get the desired result we have to reindex...
Keep it simple:
GB = DF.groupby([(DF.index.year), (DF.index.month)]).sum()
giving you,
print(GB)
abc xyz
2013 6 80 250
8 40 -5
2014 1 25 15
2 60 80
and then you can plot like asked using,
GB.plot('abc', 'xyz', kind='scatter')
There are different ways to do that.
I created the data frame to showcase the different techniques to filter your data.
df = pd.DataFrame({'Date': ['01-Jun-13', '03-Jun-13', '15-Aug-13', '20-Jan-14', '21-Feb-14'],
'abc': [100, -20, 40, 25, 60], 'xyz': [200, 50,-5, 15, 80] })
I separated months/year/day and separated month-year as you explained.
def getMonth(s):
return s.split("-")[1]
def getDay(s):
return s.split("-")[0]
def getYear(s):
return s.split("-")[2]
def getYearMonth(s):
return s.split("-")[1] + "-" + s.split("-")[2]
I created new columns: year, month, day and 'yearMonth'. In your case, you need one of both. You can group using two columns 'year','month' or using one column yearMonth
df['year'] = df['Date'].apply(lambda x: getYear(x))
df['month'] = df['Date'].apply(lambda x: getMonth(x))
df['day'] = df['Date'].apply(lambda x: getDay(x))
df['YearMonth'] = df['Date'].apply(lambda x: getYearMonth(x))
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
3 20-Jan-14 25 15 14 Jan 20 Jan-14
4 21-Feb-14 60 80 14 Feb 21 Feb-14
You can go through the different groups in groupby(..) items.
In this case, we are grouping by two columns:
for key, g in df.groupby(['year', 'month']):
print key, g
Output:
('13', 'Jun') Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
('13', 'Aug') Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
('14', 'Jan') Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
('14', 'Feb') Date abc xyz year month day YearMonth
In this case, we are grouping by one column:
for key, g in df.groupby(['YearMonth']):
print key, g
Output:
Jun-13 Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
Aug-13 Date abc xyz year month day YearMonth
2 15-Aug-13 40 -5 13 Aug 15 Aug-13
Jan-14 Date abc xyz year month day YearMonth
3 20-Jan-14 25 15 14 Jan 20 Jan-14
Feb-14 Date abc xyz year month day YearMonth
4 21-Feb-14 60 80 14 Feb 21 Feb-14
In case you want to access a specific item, you can use get_group
print df.groupby(['YearMonth']).get_group('Jun-13')
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
Similar to get_group. This hack would help to filter values and get the grouped values.
This also would give the same result.
print df[df['YearMonth']=='Jun-13']
Output:
Date abc xyz year month day YearMonth
0 01-Jun-13 100 200 13 Jun 01 Jun-13
1 03-Jun-13 -20 50 13 Jun 03 Jun-13
You can select list of abc or xyz values during Jun-13
print df[df['YearMonth']=='Jun-13'].abc.values
print df[df['YearMonth']=='Jun-13'].xyz.values
Output:
[100 -20] #abc values
[200 50] #xyz values
You can use this to go through the dates that you have classified as "year-month" and apply criteria on it to get related data.
for x in set(df.YearMonth):
print df[df['YearMonth']==x].abc.values
print df[df['YearMonth']==x].xyz.values
I recommend also to check this answer as well.
You can also do it by creating a string column with the year and month as follows:
df['date'] = df.index
df['year-month'] = df['date'].apply(lambda x: str(x.year) + ' ' + str(x.month))
grouped = df.groupby('year-month')
However this doesn't preserve the order when you loop over the groups, e.g.
for name, group in grouped:
print(name)
Will give:
2007 11
2007 12
2008 1
2008 10
2008 11
2008 12
2008 2
2008 3
2008 4
2008 5
2008 6
2008 7
2008 8
2008 9
2009 1
2009 10
So then, if you want to preserve the order, you must do as suggested by #Q-man above:
grouped = df.groupby([df.index.year, df.index.month])
This will preserve the order in the above loop:
(2007, 11)
(2007, 12)
(2008, 1)
(2008, 2)
(2008, 3)
(2008, 4)
(2008, 5)
(2008, 6)
(2008, 7)
(2008, 8)
(2008, 9)
(2008, 10)
Some of the answers are using Date as an index instead of a column (and there's nothing wrong with doing that).
However, for anyone who has the dates stored as a column (instead of an index), remember to access the column's dt attribute. That is:
# First make sure `Date` is a datetime column
df['Date'] = pd.to_datetime(
arg=df['Date'],
format='%d-%b-%y' # Assuming dd-Mon-yy format
)
# Group by year and month
df.groupby(
[
df['Date'].dt.year,
df['Date'].dt.month
]
).sum()