I have a three-month sales data set. I need to get the sales total count by week wise and group by an agent. and want to get daily standard division by the agent in sperate table
Agent District Agent_type Date Device
12 abc br 01/02/2020 4233
12 abc br 01/02/2020 4123
12 abc br 03/02/2020 4314
12 abc br 05/02/2020 4134
12 abc br 19/02/2020 5341
12 abc br 19/02/2020 52141
12 abc br 19/02/2020 12141
12 abc br 26/02/2020 4224
12 abc br 28/02/2020 9563
12 abc br 05/03/2020 0953
12 abc br 10/03/2020 1212
12 abc br 15/03/2020 4309
12 abc br 02/03/2020 4200
12 abc br 30/03/2020 4299
12 abc br 01/04/2020 4211
12 abc br 10/04/2020 2200
12 abc br 19/04/2020 3300
12 abc br 29/04/2020 3222
12 abc br 29/04/2020 32222
12 abc br 29/04/2020 4212
12 abc br 29/04/2020 20922
12 abc br 29/04/2020 67822
13 aaa ae 15/02/2020 22222
13 aaa ae 29/02/2020 42132
13 aaa ae 10/02/2020 89022
13 aaa ae 28/02/2020 31111
13 aaa ae 28/02/2020 31132
13 aaa ae 28/02/2020 31867
13 aaa ae 14/02/2020 91122
output
Agent District Agent_type 1st_week_feb 2nd_week_feb 3rd_week_feb ..... 4th_week_apr
12 abc br count count count count
13 aaa ae count count count count
2nd output - daily std by agent
Agent tot_sale daily_std
12 22 2.40
13 7 1.34
You can use:
#convert values to datetimes
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
#get weeks strating by 1
week = df['Date'].dt.isocalendar().week
df['Week'] = (week - week.min() + 1)
#lowercase months
df['Month'] = df['Date'].dt.strftime('%b').str.lower()
print (df)
Agent Date Device Week Month
0 12 2020-02-01 4233 1 feb
1 12 2020-02-01 4123 1 feb
2 12 2020-02-03 4314 2 feb
3 12 2020-02-05 4134 2 feb
4 12 2020-02-19 5341 4 feb
5 12 2020-02-26 4224 5 feb
6 12 2020-02-28 9563 5 feb
7 12 2020-03-05 953 6 mar
8 12 2020-03-10 1212 7 mar
9 12 2020-03-15 4309 7 mar
10 12 2020-03-02 4200 6 mar
11 12 2020-03-30 4299 10 mar
12 12 2020-04-01 4211 10 apr
13 12 2020-04-10 2200 11 apr
14 12 2020-04-19 3300 12 apr
15 12 2020-04-29 3222 14 apr
16 13 2020-02-15 22222 3 feb
17 13 2020-02-29 42132 5 feb
18 13 2020-03-10 89022 7 mar
19 13 2020-03-28 31111 9 mar
20 13 2020-04-14 91122 12 apr
#if need count rows use crosstab
df1 = pd.crosstab(df['Agent'], [df['Week'], df['Month']])
df1.columns = df1.columns.map(lambda x: f'{x[0]}_week_{x[1]}')
print (df1)
1_week_feb 2_week_feb 3_week_feb 4_week_feb 5_week_feb 6_week_mar \
Agent
12 2 2 0 1 2 2
13 0 0 1 0 1 0
7_week_mar 9_week_mar 10_week_apr 10_week_mar 11_week_apr \
Agent
12 2 0 1 1 1
13 1 1 0 0 0
12_week_apr 14_week_apr
Agent
12 1 1
13 1 0
#if need sum Device column use pivot_table
df2 = df.pivot_table(index='Agent',
columns=['Week', 'Month'],
values='Device',
aggfunc='sum',
fill_value=0)
df2.columns = df2.columns.map(lambda x: f'{x[0]}_week_{x[1]}')
print (df2)
1_week_feb 2_week_feb 3_week_feb 4_week_feb 5_week_feb 6_week_mar \
Agent
12 8356 8448 0 5341 13787 5153
13 0 0 22222 0 42132 0
7_week_mar 9_week_mar 10_week_apr 10_week_mar 11_week_apr \
Agent
12 5521 0 4211 4299 2200
13 89022 31111 0 0 0
12_week_apr 14_week_apr
Agent
12 3300 3222
13 91122 0
EDIT: Thank you #Henry Yik for pointed another way for count weeks by days:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['Week'] = (df["Date"].dt.day-1)//7+1
df['Month'] = df['Date'].dt.strftime('%b').str.lower()
print (df)
Agent Date Device Week Month
0 12 2020-02-01 4233 1 feb
1 12 2020-02-01 4123 1 feb
2 12 2020-02-03 4314 1 feb
3 12 2020-02-05 4134 1 feb
4 12 2020-02-19 5341 3 feb
5 12 2020-02-26 4224 4 feb
6 12 2020-02-28 9563 4 feb
7 12 2020-03-05 953 1 mar
8 12 2020-03-10 1212 2 mar
9 12 2020-03-15 4309 3 mar
10 12 2020-03-02 4200 1 mar
11 12 2020-03-30 4299 5 mar
12 12 2020-04-01 4211 1 apr
13 12 2020-04-10 2200 2 apr
14 12 2020-04-19 3300 3 apr
15 12 2020-04-29 3222 5 apr
16 13 2020-02-15 22222 3 feb
17 13 2020-02-29 42132 5 feb
18 13 2020-03-10 89022 2 mar
19 13 2020-03-28 31111 4 mar
20 13 2020-04-14 91122 2 apr
Assuming that Date column has been converted to datetime, you can do your
task in the following one-liner:
df.groupby(['Agent', pd.Grouper(key='Date', freq='W-MON', closed='left',
label='left')]).count().unstack(level=1, fill_value=0)
For your data sample the result is:
Device
Date 2020-01-27 2020-02-03 2020-02-17 2020-02-24 2020-03-02 2020-03-09 2020-03-30 2020-04-06 2020-04-13 2020-04-27 2020-02-10 2020-03-23
Agent
12 2 2 1 2 2 2 2 1 1 1 0 0
13 0 0 0 1 0 1 0 0 1 0 1 1
The column name is from the date of a Monday "opening" the week.
Related
ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67
So I have a data frame that is something like this
Resource 2020-06-01 2020-06-02 2020-06-03
Name1 8 7 8
Name2 7 9 9
Name3 10 10 10
Imagine that the header is literal all the days of the month. And that there are way more names than just three.
I need to reduce the columns to five. Considering the first column to be the days between 2020-06-01 till 2020-06-05. Then from Saturday till Friday of the same week. Or the last day of the month if it is before Friday. So for June would be these weeks:
week 1: 2020-06-01 to 2020-06-05
week 2: 2020-06-06 to 2020-06-12
week 3: 2020-06-13 to 2020-06-19
week 4: 2020-06-20 to 2020-06-26
week 5: 2020-06-27 to 2020-06-30
I have no problem defining these weeks. The problem is grouping the columns based on them.
I couldn't come up with anything.
Does someone have any ideas about this?
I have to use these code to generate your dataframe.
dates = pd.date_range(start='2020-06-01', end='2020-06-30')
df = pd.DataFrame({
'Name1': np.random.randint(1, 10, size=len(dates)),
'Name2': np.random.randint(1, 10, size=len(dates)),
'Name3': np.random.randint(1, 10, size=len(dates)),
})
df = df.set_index(dates).transpose().reset_index().rename(columns={'index': 'Resource'})
Then, the solution starts from here.
# Set the first column as index
df = df.set_index(df['Resource'])
# Remove the unused column
df = df.drop(columns=['Resource'])
# Transpose the dataframe
df = df.transpose()
# Output:
Resource Name1 Name2 Name3
2020-06-01 00:00:00 3 2 7
2020-06-02 00:00:00 5 6 8
2020-06-03 00:00:00 2 3 6
...
# Bring "Resource" from index to column
df = df.reset_index()
df = df.rename(columns={'index': 'Resource'})
# Add a column "week of year"
df['week_no'] = df['Resource'].dt.weekofyear
# You can simply group by the week no column
df.groupby('week_no').sum().reset_index()
# Output:
Resource week_no Name1 Name2 Name3
0 23 38 42 41
1 24 37 30 43
2 25 38 29 23
3 26 29 40 42
4 27 2 8 3
I don't know what you want to do for the next. If you want your original form, just transpose() it back.
EDIT: OP claimed the week should start from Saturday end up with Friday
# 0: Monday
# 1: Tuesday
# 2: Wednesday
# 3: Thursday
# 4: Friday
# 5: Saturday
# 6: Sunday
df['weekday'] = df['Resource'].dt.weekday.apply(lambda day: 0 if day <= 4 else 1)
df['customised_weekno'] = df['week_no'] + df['weekday']
Output:
Resource Resource Name1 Name2 Name3 week_no weekday customised_weekno
0 2020-06-01 4 7 7 23 0 23
1 2020-06-02 8 6 7 23 0 23
2 2020-06-03 5 9 5 23 0 23
3 2020-06-04 7 6 5 23 0 23
4 2020-06-05 6 3 7 23 0 23
5 2020-06-06 3 7 6 23 1 24
6 2020-06-07 5 4 4 23 1 24
7 2020-06-08 8 1 5 24 0 24
8 2020-06-09 2 7 9 24 0 24
9 2020-06-10 4 2 7 24 0 24
10 2020-06-11 6 4 4 24 0 24
11 2020-06-12 9 5 7 24 0 24
12 2020-06-13 2 4 6 24 1 25
13 2020-06-14 6 7 5 24 1 25
14 2020-06-15 8 7 7 25 0 25
15 2020-06-16 4 3 3 25 0 25
16 2020-06-17 6 4 5 25 0 25
17 2020-06-18 6 8 2 25 0 25
18 2020-06-19 3 1 2 25 0 25
So, you can use customised_weekno for grouping.
I have read a couple of similar post regarding the issue before, but none of the solutions worked for me. so I got the followed csv :
Score date term
0 72 3 Feb · 1
1 47 1 Feb · 1
2 119 6 Feb · 1
8 101 7 hrs · 1
9 536 11 min · 1
10 53 2 hrs · 1
11 20 11 Feb · 3
3 15 1 hrs · 2
4 33 7 Feb · 1
5 153 4 Feb · 3
6 34 3 min · 2
7 26 3 Feb · 3
I want to sort the csv by date. What's the easiest way to do that ?
You can create 2 helper columns - one for datetimes created by to_datetime and second for timedeltas created by to_timedelta, only necessary format HH:MM:SS, so added Series.replace by regexes, so last is possible sorting by 2 columns by DataFrame.sort_values:
df['date1'] = pd.to_datetime(df['date'], format='%d %b', errors='coerce')
times = df['date'].replace({'(\d+)\s+min': '00:\\1:00',
'\s+hrs': ':00:00'}, regex=True)
df['times'] = pd.to_timedelta(times, errors='coerce')
df = df.sort_values(['times','date1'])
print (df)
Score date term date1 times
6 34 3 min 2 NaT 00:03:00
9 536 11 min 1 NaT 00:11:00
3 15 1 hrs 2 NaT 01:00:00
10 53 2 hrs 1 NaT 02:00:00
8 101 7 hrs 1 NaT 07:00:00
1 47 1 Feb 1 1900-02-01 NaT
0 72 3 Feb 1 1900-02-03 NaT
7 26 3 Feb 3 1900-02-03 NaT
5 153 4 Feb 3 1900-02-04 NaT
2 119 6 Feb 1 1900-02-06 NaT
4 33 7 Feb 1 1900-02-07 NaT
11 20 11 Feb 3 1900-02-11 NaT
Current df:
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018
.. ..
I have the df with ID and Date. ID is unique in the original df.
I would like to create a new df based on date. Each ID has a Max Date, I would like to use that date and go back 4 days(5 rows each ID)
There are thousands of IDs.
Expect to get:
ID Date
11 3/15/2018
11 3/16/2018
11 3/17/2018
11 3/18/2018
11 3/19/2018
22 1/1/2018
22 1/2/2018
22 1/3/2018
22 1/4/2018
22 1/5/2018
33 2/8/2018
33 2/9/2018
33 2/10/2018
33 2/11/2018
33 2/12/2018
… …
I tried the following method, i think use date_range might be right direction, but I keep get error.
pd.date_range
def date_list(row):
list = pd.date_range(row["Date"], periods=5)
return list
df["Date_list"] = df.apply(date_list, axis = "columns")
Here is another by using df.assign to overwrite date and pd.concat to glue the range together. cᴏʟᴅsᴘᴇᴇᴅ's solution wins in performance but I think this might be a nice addition as it is quite easy to read and understand.
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
Alternative:
dates = (pd.date_range(*x) for x in zip(df['Date']-pd.Timedelta(days=4), df['Date']))
df = (pd.DataFrame(dict(zip(df['ID'],dates)))
.T
.stack()
.reset_index(0)
.rename(columns={'level_0': 'ID', 0: 'Date'}))
Full example:
import pandas as pd
data = '''\
ID Date
11 3/19/2018
22 1/5/2018
33 2/12/2018'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+')
df['Date']= pd.to_datetime(df.Date)
df = pd.concat([df.assign(Date=df.Date - pd.Timedelta(days=i)) for i in range(5)])
df.sort_values(by=['ID','Date'], ascending = [True,True], inplace=True)
print(df)
Returns:
ID Date
0 11 2018-03-15
0 11 2018-03-16
0 11 2018-03-17
0 11 2018-03-18
0 11 2018-03-19
1 22 2018-01-01
1 22 2018-01-02
1 22 2018-01-03
1 22 2018-01-04
1 22 2018-01-05
2 33 2018-02-08
2 33 2018-02-09
2 33 2018-02-10
2 33 2018-02-11
2 33 2018-02-12
reindexing with pd.date_range
Let's try creating a flat list of date-ranges and reindexing this DataFrame.
from itertools import chain
v = df.assign(Date=pd.to_datetime(df.Date)).set_index('Date')
# assuming ID is a string column
v.reindex(chain.from_iterable(
pd.date_range(end=i, periods=5) for i in v.index)
).bfill().reset_index()
Date ID
0 2018-03-14 11
1 2018-03-15 11
2 2018-03-16 11
3 2018-03-17 11
4 2018-03-18 11
5 2018-03-19 11
6 2017-12-31 22
7 2018-01-01 22
8 2018-01-02 22
9 2018-01-03 22
10 2018-01-04 22
11 2018-01-05 22
12 2018-02-07 33
13 2018-02-08 33
14 2018-02-09 33
15 2018-02-10 33
16 2018-02-11 33
17 2018-02-12 33
concat based solution on keys
Just for fun. My reindex solution is definitely more performant and easier to read, so if you were to pick one, use that.
v = df.assign(Date=pd.to_datetime(df.Date))
v_dict = {
j : pd.DataFrame(
pd.date_range(end=i, periods=5), columns=['Date']
)
for j, i in zip(v.ID, v.Date)
}
(pd.concat(v_dict, axis=0)
.reset_index(level=1, drop=True)
.rename_axis('ID')
.reset_index()
)
ID Date
0 11 2018-03-14
1 11 2018-03-15
2 11 2018-03-16
3 11 2018-03-17
4 11 2018-03-18
5 11 2018-03-19
6 22 2017-12-31
7 22 2018-01-01
8 22 2018-01-02
9 22 2018-01-03
10 22 2018-01-04
11 22 2018-01-05
12 33 2018-02-07
13 33 2018-02-08
14 33 2018-02-09
15 33 2018-02-10
16 33 2018-02-11
17 33 2018-02-12
group by ID, select the column Date, and for each group generate a series of five days leading up to the greatest date.
rather than writing a long lambda, I've written a helper function.
def drange(x):
e = x.max()
s = e-pd.Timedelta(days=4)
return pd.Series(pd.date_range(s,e))
res = df.groupby('ID').Date.apply(drange)
Then drop the extraneous level from the resulting multiindex and we get our desired output
res.reset_index(level=0).reset_index(drop=True)
# outputs:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
5 22 2018-01-01
6 22 2018-01-02
7 22 2018-01-03
8 22 2018-01-04
9 22 2018-01-05
10 33 2018-02-08
11 33 2018-02-09
12 33 2018-02-10
13 33 2018-02-11
14 33 2018-02-12
Compact alternative
# Help function to return Serie with daterange
func = lambda x: pd.date_range(x.iloc[0]-pd.Timedelta(days=4), x.iloc[0]).to_series()
res = df.groupby('ID').Date.apply(func).reset_index().drop('level_1',1)
You can try groupby with date_range
df.groupby('ID').Date.apply(lambda x : pd.Series(pd.date_range(end=x.iloc[0],periods=5))).reset_index(level=0)
Out[793]:
ID Date
0 11 2018-03-15
1 11 2018-03-16
2 11 2018-03-17
3 11 2018-03-18
4 11 2018-03-19
0 22 2018-01-01
1 22 2018-01-02
2 22 2018-01-03
3 22 2018-01-04
4 22 2018-01-05
0 33 2018-02-08
1 33 2018-02-09
2 33 2018-02-10
3 33 2018-02-11
4 33 2018-02-12
I´ve got the following data frame:
IDENT YEAR MONTH DAY HOUR MIN XXXX YYYY GPS SNR
0 0 2015 5 13 5 0 20.45 16 0 44
1 0 2015 5 13 4 0 20.43 16 0 44
2 0 2015 5 13 3 0 20.42 16 0 44
3 0 2015 5 13 2 0 20.47 16 0 40
4 0 2015 5 13 1 0 20.50 16 0 44
5 0 2015 5 13 0 0 20.54 16 0 44
6 0 2015 5 12 23 0 20.56 16 0 40
It comes from a csv file and I´d made the dataframe using Python Pandas.
Now I´d like to join the columns YEAR+MONTH+DAY+HOUR+MIN to make a new one, for example
DATE-TIME
2015-5-13-5-0
How can I do that ?
date_cols = ['YEAR','MONTH','DAY','HOUR','MIN']
df[date_cols] = df[date_cols].astype(str)
df['the_date'] = df[date_cols].apply(lambda x: '-'.join(x),axis=1)
Output:
IDENT YEAR MONTH DAY HOUR MIN XXXX YYYY GPS SNR the_date
0 0 2015 5 13 5 0 20.45 16 0 44 2015-5-13-5-0
1 0 2015 5 13 4 0 20.43 16 0 44 2015-5-13-4-0
2 0 2015 5 13 3 0 20.42 16 0 44 2015-5-13-3-0
3 0 2015 5 13 2 0 20.47 16 0 40 2015-5-13-2-0
4 0 2015 5 13 1 0 20.50 16 0 44 2015-5-13-1-0
5 0 2015 5 13 0 0 20.54 16 0 44 2015-5-13-0-0
6 0 2015 5 12 23 0 20.56 16 0 40 2015-5-12-23-0
df.loc[:, 'DATE-TIME'] = df.apply(lambda x: "{0}-{1}-{2}-{3}-{4}"
.format(int(x.YEAR),
int(x.MONTH),
int(x.DAY),
int(x.HOUR),
int(x.MIN)),
axis=1)
>>> df
IDENT YEAR MONTH DAY HOUR MIN XXXX YYYY GPS SNR DATE-TIME
0 0 2015 5 13 5 0 20.45 16 0 44 2015-5-13-5-0
1 0 2015 5 13 4 0 20.43 16 0 44 2015-5-13-4-0
2 0 2015 5 13 3 0 20.42 16 0 44 2015-5-13-3-0
3 0 2015 5 13 2 0 20.47 16 0 40 2015-5-13-2-0
4 0 2015 5 13 1 0 20.50 16 0 44 2015-5-13-1-0
5 0 2015 5 13 0 0 20.54 16 0 44 2015-5-13-0-0
6 0 2015 5 12 23 0 20.56 16 0 40 2015-5-12-23-0