I have a dataframe containing two columns of dates: start date and end date. I need to set up a dataframe where all months of the year are set up in separate columns based on the start and end date intervals so I can sum values from another column for each of the months per name.
To illustrate:
Original df:
Start Date End Date Name Value
10/22/20 01/25/21 John 100
10/12/20 04/30/21 John 50
02/25/21 None John 20
Desired df:
Name Oct_20 Nov_20 Dec_20 Jan_21 Feb_21 Mar_21 Apr_21 May_21 Jun_21 Jul_21 Aug_21 ...
John 150 150 150 150 70 70 70 20 20 20 20 ...
Any suggestions or pointers on how I could achieve that result would be greatly appreciated!
First convert values to datetimes with replace non datetimes to missing values and replace them to some date, then in list comprehension get all months to Series, which is used for pivoting by DataFrame.pivot_table:
end = '2021-12-31'
df['Start'] = pd.to_datetime(df['Start Date'])
df['End'] = pd.to_datetime(df['End Date'], errors='coerce').fillna(end)
s = pd.concat([pd.Series(r.Index,pd.date_range(r.Start, r.End, freq='M'))
for r in df.itertuples()])
df1 = pd.DataFrame({'Date': s.index}, s).join(df)
df2 = df1.pivot_table(index='Name',
columns='Date',
values='Value',
aggfunc='sum',
fill_value=0)
df2.columns = df2.columns.strftime('%b_%y')
print (df2)
Date Oct_20 Nov_20 Dec_20 Jan_21 Feb_21 Mar_21 Apr_21 May_21 Jun_21 \
Name
John 150 150 150 50 70 70 70 20 20
Date Jul_21 Aug_21 Sep_21 Oct_21 Nov_21 Dec_21
Name
John 20 20 20 20 20 20
Related
i have data with 3 columns: date, id, sales.
my first task is filtering sales above 100. i did it.
second task, grouping id by consecutive days.
index
date
id
sales
0
01/01/2018
03
101
1
01/01/2018
07
178
2
02/01/2018
03
120
3
03/01/2018
03
150
4
05/01/2018
07
205
the result should be:
index
id
count
0
03
3
1
07
1
2
07
1
i need to do this task without using pandas/dataframe, but right now i can't imagine from which side attack this problem.
just for effort, i tried the suggestion for a solution here count consecutive days python dataframe
but the ids' not grouped.
here is my code:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date']).dt.date
s = data.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = data.groupby(['id', s]).size().reset_index(level=0, drop=True)
it is very importent that the "new_frame" will have "count" column, because after i need to count id by range of those count days in "count" column. e.g. count of id's in range of 0-7 days, 7-12 days etc. but it's not part of my question.
Thank you a lot
Your code is close, but need some fine-tuning, as follows:
data = df[df['sales'] >= 100]
data['date'] = pd.to_datetime(data['date'], dayfirst=True)
df2 = data.sort_values(['id', 'date'])
s = df2.groupby('id').date.diff().dt.days.ne(1).cumsum()
new_frame = df2.groupby(['id', s]).size().reset_index(level=1, drop=True).reset_index(name='count')
Result:
print(new_frame)
id count
0 3 3
1 7 1
2 7 1
Summary of changes:
As your dates are in dd/mm/yyyy instead of the default mm/dd/yyyy, you have to specify the parameter dayfirst=True in pd.to_datetime(). Otherwise, 02/01/2018 will be regarded as 2018-02-01 instead of 2018-01-02 as expected and the day diff with adjacent entries will be around 30 as opposed to 1.
We added a sort step to sort by columns id and date to simplify the later grouping during the creation of the series s.
In the last groupby() the code reset_index(level=0, drop=True) should be dropping level=1 instead. Since, level=0 is the id fields which we want to keep.
In the last groupby(), we do an extra .reset_index(name='count') to make the Pandas series change back to a dataframe and also name the new column as count.
I have merged two dataframes with multiple overlapping columns. I would like to put the columns side by side.
merge = df1.merge(df2)
For example, Current Output:
YEAR_x,DATE_x,MAX_x,MIN_x,YEAR_y,DATE_y,MAX_y,MIN_y
I want the output to be:
YEAR, YEAR_auto, DATE, DATE_auto, MAX, MAX_auto, MIN, MIN_auto
I have more than 150 columns so I don't want to do it manually. How could I do that?
Use pd.merge with suffixes parameter:
merge = df1.merge(df2[set(df2) & set(df1)], suffixes=('', '_auto'))
To sort your columns as df1:
cols = sorted(merge.columns, key=lambda x: df1.columns.get_loc(x.split('_')[0]))
Example:
>>> merge
YEAR DATE MAX MIN YEAR_auto DATE_auto MAX_auto MIN_auto
0 2021 2021-08-06 100 0 2020 2020-08-06 50 20
>>> merge[cols]
YEAR YEAR_auto DATE DATE_auto MAX MAX_auto MIN MIN_auto
0 2021 2020 2021-08-06 2020-08-06 100 50 0 20
I have a data frame with the date as an index and a parameter. I want to convert column data into a new data frame with year as row index and week number as column name and cells showing weekly mean value. I would then use this information to plot using seaborn https://seaborn.pydata.org/generated/seaborn.relplot.html.
My data:
df =
data
2019-01-03 10
2019-01-04 20
2019-05-21 30
2019-05-22 40
2020-10-15 50
2020-10-16 60
2021-04-04 70
2021-04-05 80
My code:
# convert the df into weekly averaged dataframe
wdf = df.groupby(df.index.dt.strftime('%Y-%W')).data.mean()
wdf
2019-01 15
2019-26 35
2020-45 55
2021-20 75
Expected answer: Column name denotes the week number, index denotes the year. Cell denotes the sample's mean in that week.
01 20 26 45
2019 15 NaN 35 NaN # 15 is mean of 1st week (10,20) in above df
2020 NaN NaN NaN 55
2021 NaN 75 NaN NaN
No idea on how to proceed further to get the expected answer from the above-obtained solution.
You can use a pivot_table :
df['year'] = pd.DatetimeIndex(df['date']).year
df['week'] = pd.DatetimeIndex(df['date']).week
final_table = pd.pivot_table(data = df,index= 'year', columns = 'week',values = 'data', aggfunc = np.mean )
You need to use two dimensions in the groupby, and then unstack to lay out the data as a grid:
df.groupby([df.index.year,df.index.week])['data'].mean().unstack()
I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.
Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0
Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08
I have a dataframe containing dates and prices. I need to add all prices belonging to the week of ex: 17/12 to 23/12 and put it infront of a new label corresponding to that week.
Date Price
12/17/2015 10
12/18/2015 20
12/19/2015 30
12/21/2015 40
12/24/2015 50
I want the output to be the following
week total
17/12-23/12 100
24/12-30/12 50
I tried using different datetime functions and groupby functions but was not able to get the o/p. Please help
what about this approach?
In [19]: df.groupby(df.Date.dt.weekofyear)['Price'].sum().rename_axis('week_no').reset_index(name='total')
Out[19]:
week_no total
0 51 60
1 52 90
UPDATE:
In [49]: df.resample(on='Date', rule='7D', base='4D').sum().rename_axis('week_from') \
.reset_index('total')
Out[49]:
week_from Price
0 2015-12-17 100
1 2015-12-24 50
UPDATE2:
x = (df.resample(on='Date', rule='7D', base='4D')
.sum()
.reset_index()
.rename(columns={'Price':'total'}))
x = x.assign(week=x['Date'].dt.strftime('%d/%m')
+'-'
+(x.pop('Date')+pd.DateOffset(days=7)).dt.strftime('%d/%m'))
In [127]: x
Out[127]:
total week
0 100 17/12-24/12
1 50 24/12-31/12
Using resample
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(df.Date, inplace = True)
df = df.resample('W').sum()
Price
Date
2015-12-20 60
2015-12-27 90