Please, suggest a more suitable title for this question
I have: Two-level indexed DF (crated via groupby):
clicks yield
country report_date
AD 2016-08-06 1 31
2016-12-01 1 0
AE 2016-10-11 1 0
2016-10-13 2 0
I need:
Consequently take country by country data, process it and put it back:
for country in set(DF.get_level_values(0)):
DF_country = process(DF.loc[country])
DF[country] = DF_country
Where process add new rows to DF_country.
Problem is in last string:
ValueError: Wrong number of items passed 2, placement implies 1
I just modify your code, I change the process to add, Base on my understanding process is a self-define function right ?
for country in set(DF.index.get_level_values(0)): # change here
DF_country = DF.loc[country].add(1)
DF.loc[country] = DF_country.values #and here
DF
Out[886]:
clicks yield
country report_date
AD 2016-08-06 2 32
2016-12-01 2 1
AE 2016-10-11 2 1
2016-10-13 3 1
EDIT :
l=[]
for country in set(DF.index.get_level_values(0)):
DF1=DF.loc[country]
DF1.loc['2016-01-01']=[1,2] #adding row here
l.append(DF1)
pd.concat(l,axis=0,keys=set(DF.index.get_level_values(0)))
Out[923]:
clicks yield
report_date
AE 2016-10-11 1 0
2016-10-13 2 0
2016-01-01 1 2
AD 2016-08-06 1 31
2016-12-01 1 0
2016-01-01 1 2
Related
My apologies for the vague title, it's complicated to translate what I want in writing terms.
I'm trying to build a filled line chart with the date on x axis and total transaction over time on the y axis
My data
The object is a pandas dataframe.
date | symbol | type | qty | total
----------------------------------------------
2020-09-10 ABC Buy 5 10
2020-10-18 ABC Buy 2 20
2020-09-19 ABC Sell 3 15
2020-11-05 XYZ Buy 10 8
2020-12-03 XYZ Buy 10 9
2020-12-05 ABC Buy 2 5
What I whant
date | symbol | type | qty | total | aggregate_total
------------------------------------------------------------
2020-09-10 ABC Buy 5 10 10
2020-10-18 ABC Buy 2 20 10+20 = 30
2020-09-19 ABC Sell 3 15 10+20-15 = 15
2020-11-05 XYZ Buy 10 8 8
2020-12-03 XYZ Buy 10 9 8+9 = 17
2020-12-05 ABC Buy 2 5 10+20-15+5 = 20
Where I am now
I'm working with 2 nested for loops : one for iterating over the symbols, one for iterating each row. I store the temporary results in lists. I'm still unsure how I will add the results to the final dataframe. I could reorder the dataframe by symbol and date, then append each temp lists together and finally assign that temp list to a new column.
The code below is just the inner loop over the rows.
af = df.loc[df['symbol'] == 'ABC']
for i in (range(0,af.shape[0])):
# print(af.iloc[0:i,[2,4]])
# if type is a buy, we add the last operation to the aggregate
if af.iloc[i,2] == "BUY":
temp_agg_total.append(temp_agg_total[i] + af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] + af.iloc[i, 3])
else:
temp_agg_total.append(temp_agg_total[i] - af.iloc[i,4])
temp_agg_qty.append(temp_agg_qty[i] - af.iloc[i, 3])
# Remove first element of list (0)
temp_agg_total.pop(0)
temp_agg_qty.pop(0)
af = af.assign(agg_total = temp_agg_total,
agg_qty = temp_agg_qty)
My question
Is there a better way to do this in pandas or numpy ? It feels really heavy for something relatively simple.
The presence of the Buy/Sell type of operation complicates things.
Regards
# negate qty of Sells
df.loc[df['type']=='Sell', 'total'] *=-1
# cumulative sum of the qty based on symbol
df['aggregate_total'] = df.groupby('symbol')['total'].cumsum()
Is this which you're looking for?
df['Agg'] = 1
df.loc[df['type'] == 'Sell', 'Agg'] = -1
df['Agg'] = df['Agg']*df['total']
df['Agg'].cumsum()
df["Type_num"] = df["type"].map({"Buy":1,"Sell":-1})
df["Num"] = df.Type_num*df.total
df.groupby(["symbol"],as_index=False)["Num"].cumsum()
pd.concat([df,df.groupby(["symbol"],as_index=False)["Num"].cumsum()],axis=1)
date symbol type qty total Type_num Num CumNum
0 2020-09-10 ABC Buy 5 10 1 10 10
1 2020-10-18 ABC Buy 2 20 1 20 30
2 2020-09-19 ABC Sell 3 15 -1 -15 15
3 2020-11-05 XYZ Buy 10 8 1 8 8
4 2020-12-03 XYZ Buy 10 9 1 9 17
5 2020-12-05 ABC Buy 2 5 1 5 20
The most important thing here is the cumulative sum. The regrouping is used to make sure that the cumulative sum is just performed on each kind of different symbol. The renaming and dropping of columns should be easy for you.
Trick is that I made {sell; buy} into {1,-1}
Can I pass a list to pandas series as index?
I have the following dataframe:
d = {'no': ['1','2','3','4','5','6','7','8','9'], 'buyer_code': ['Buy1', 'Buy2', 'Buy3', 'Buy1', 'Buy2', 'Buy2', 'Buy2', 'Buy1', 'Buy3'], 'dollar_amount': ['200.25', '350.00', '120.00', '400.50', '1231.25', '700.00', '350.00', '200.25', '2340.00'], 'date': ['22-01-2010','14-03-2010','17-06-2010','13-04-2011','17-05-2011','28-01-2012','23-07-2012','25-10-2012','25-12-2012']}
df = pd.DataFrame(data=d)
df
buyer_code date dollar_amount no
0 Buy1 22-01-2010 200.25 1
1 Buy2 14-03-2010 350.00 2
2 Buy3 17-06-2010 120.00 3
3 Buy1 13-04-2011 400.50 4
4 Buy2 17-05-2011 1231.25 5
5 Buy2 28-01-2012 700.00 6
6 Buy2 23-07-2012 350.00 7
7 Buy1 25-10-2012 200.25 8
8 Buy3 25-12-2012 2340.00 9
Converting to float for aggregate
pd.options.display.float_format = '{:,.4f}'.format
df['dollar_amount'] = df['dollar_amount'].astype(float)
Getting the most important Buyers by frequency and dollars:
NOTE: Here I am getting just top 2 buyers, In real example I might have to get upto 40 buyers.
xx = df.groupby('buyer_code').agg({'dollar_amount' : 'mean', 'no' : 'size'})
xx['frqAmnt'] = xx['no'].values * xx['dollar_amount'].values
xx = xx['frqAmnt'].nlargest(2)
xx
buyer_code
Buy2 2,631.2500
Buy3 2,460.0000
Name: frqAmnt, dtype: float64
Grouping buyers and their purchase dates:
zz = df.groupby(['buyer_code'])['date'].value_counts().groupby('buyer_code').head(all)
zz
buyer_code date
Buy1 2010-01-22 1
2011-04-13 1
2012-10-25 1
Buy2 2010-03-14 1
2011-05-17 1
2012-01-28 1
2012-07-23 1
Buy3 2010-06-17 1
2012-12-25 1
Name: date, dtype: int64
Now I want to pass my top buyer_codes to my zz sereis to get only the transactional data corresponding to those buyers.
How can I do it? I might be on the wrong path here, but kindly help me out.
I think you need:
a = zz[zz.index.get_level_values(0).isin(xx.index)]
print (a)
buyer_code date
Buy2 14-03-2010 1
17-05-2011 1
23-07-2012 1
28-01-2012 1
Buy3 17-06-2010 1
25-12-2012 1
Name: date, dtype: int64
For order need reindex:
a = zz[zz.index.get_level_values(0).isin(xx.index)].reindex(xx.index, level=0)
And for all dates by buyer_code:
b = a.reset_index(name='a').groupby('buyer_code')['date'].apply(list).reset_index()
print (b)
buyer_code date
0 Buy2 [14-03-2010, 17-05-2011, 23-07-2012, 28-01-2012]
1 Buy3 [17-06-2010, 25-12-2012]
I have a dataset about users taking online courses. It has features like, 'id', 'event', 'time'. I groupby them and want to know the frequency of a user doing every event on specific days. I want to count them in days.
lt = log_train.groupby(['enrollment_id','event','time']).size()
print(lt)
enrollment_id event time
1 access 2014-06-14T09:38:39 2
2014-06-14T09:38:48 1
2014-06-19T06:21:16 2
2014-06-19T06:21:32 1
2014-06-19T06:21:45 1
..
200887 navigate 2014-07-24T03:27:16 1
200887 navigate 2014-07-24T03:27:16 1
page_close 2014-07-24T04:19:55 1
video 2014-07-24T04:19:57 1
200888 access 2014-07-24T03:48:14 2
discussion 2014-07-24T03:47:57 1
navigate 2014-07-24T03:47:17 1
2014-07-24T03:47:28 1
2014-07-24T03:48:01 1
From the information I have seen in another dataset there are userIDs, courseIDs and course range time.
usercourse = pd.merge(enroll,date,how="left", on= 'course_id' )
enrollment_id username \
0 1 9Uee7oEuuMmgPx2IzPfFkWgkHZyPbWr0
1 3 1qXC7Fjbwp66GPQc6pHLfEuO8WKozxG4
2 4 FIHlppZyoq8muPbdVxS44gfvceX9zvU7
course_id from to
0 DPnLzkJJqOOPRJfBxIHbQEERiYHu5ila 2014-06-12 2014-07-11
1 7GRhBDsirIGkRZBtSMEzNTyDr2JQm4xx 2014-06-19 2014-07-18
2 DPnLzkJJqOOPRJfBxIHbQEERiYHu5ila 2014-06-12 2014-07-11
Every single user has only 1 course and all the courses have the same range with 30 days. So what I want to have should be similar like this,
enrollment_id event #ofDays #ofActionTimes
1 access 2 2
10 6
30 2
..
200887 navigate 23 1
page_close 30 1
video 1 1
200888 access 12 2
discussion 2 1
navigate 5 3
29 4
**#ofDays means at the Nth day of a course.
#ofActionTimes means how often an event happens on the Nth day.**
Since every course started from different dates I have no idea how to generate this data form on python.
Hope someone could help me to solve the problem!
IIUC, you can use merge, groupby, and count to get what you want.
First, some example data. This is based on the data you provided, but I've modified it so that the output can be clearly traced from the starting data.
data1 = {"enrollment_id":[1,1,1,1,2,2,3,3,3],
"event":["access","access","access","navigate","access",
"page_close","navigate","navigate","video"],
"time":["2014-06-14T09:38:39", "2014-06-14T09:38:48",
"2014-06-19T06:21:16", "2014-06-19T06:21:32",
"2014-06-21T06:21:45", "2014-06-22T06:21:16",
"2014-06-19T06:21:32", "2014-06-20T06:21:16",
"2014-06-20T06:21:16"]}
data2 = {"enrollment_id":[1,2,3],
"username":["user1", "user2", "user3"],
"course_id":["course1", "course2", "course3"],
"course_from":["2014-06-12", "2014-06-19", "2014-06-12"],
"course_to":["2014-07-11", "2014-07-18", "2014-07-11"]}
df1 = pd.DataFrame(data1)
df1
enrollment_id event time
0 1 access 2014-06-14T09:38:39
1 1 access 2014-06-14T09:38:48
2 1 access 2014-06-19T06:21:16
3 1 navigate 2014-06-19T06:21:32
4 2 access 2014-06-21T06:21:45
5 2 page_close 2014-06-22T06:21:16
6 3 navigate 2014-06-19T06:21:32
7 3 navigate 2014-06-20T06:21:16
8 3 video 2014-06-20T06:21:16
df2 = pd.DataFrame(data2)
df2
course_id enrollment_id course_from course_to username
0 course1 1 2014-06-12 2014-07-11 user1
1 course2 2 2014-06-19 2014-07-18 user2
2 course3 3 2014-06-12 2014-07-11 user3
We want to know how many times a specific event happened for a specific enrollment_id, with a separate count for each day of the course.
Derive the course day number course_day_num by subtracting course_from (the course start date) from event_date.
df = (df1.merge(df2[["enrollment_id", "course_from"]],
on="enrollment_id", how="left")
)
df["event_date"] = pd.to_datetime(pd.to_datetime(df1.time).dt.date)
df["course_from"] = pd.to_datetime(df["course_from"])
df["course_day_num"] = (df.event_date - df["course_from"]).dt.days
Then groupby each course_day_num to get the event count, per person, per course day:
groupby_cols = ["enrollment_id", "event", "event_date", "course_day_num"]
df.groupby(groupby_cols).event_date.count()
enrollment_id event event_date course_day_num
1 access 2014-06-14 2 2
2014-06-19 7 1
navigate 2014-06-19 7 1
2 access 2014-06-21 2 1
page_close 2014-06-22 3 1
3 navigate 2014-06-19 7 1
2014-06-20 8 1
video 2014-06-20 8 1
Name: event_date, dtype: int64
I am learning python and at the moment I am playing with some sales data. The data is in csv format and is showing weekly sales.
I have below columns with some sample data as below:
store# dept# dates weeklysales
1 1 01/01/2005 50000
1 1 08/01/2005 120000
1 1 15/01/2005 75000
1 1 22/01/2005 25000
1 1 29/01/2005 18000
1 2 01/01/2005 15000
1 2 08/01/2005 12000
1 2 15/01/2005 75000
1 2 22/01/2005 35000
1 2 29/01/2005 28000
1 1 01/02/2005 50000
1 1 08/02/2005 120000
1 1 15/02/2005 75000
1 1 22/03/2005 25000
1 1 29/03/2005 18000
I want to add the weeklysales to monthly basis in each department and want to display the records.
I have tried to use groupby function in Pandas from below links:
how to convert monthly data to quarterly in pandas
Pandas group by and sum two columns
Pandas group-by and sum
But what is happening in the above that I get sum of all the columns and getting the following output by adding the store and dept numbers as well:
store# dept# dates weeklysales
4 3 01/2005 28800
4 1 01/2005 165000
4 3 02/2005 245000
4 3 03/2005 43000
I do not want to add store and dept numbers but want to just add the weeklysales figure by each month and want the display like:
store# dept# dates weeklysales
1 1 01/2005 28800
1 2 01/2005 165000
1 1 02/2005 245000
1 1 03/2005 43000
Will be grateful if I can get a solution for that.
Cheers,
Is this what you are after?
Convert dates to month/year format and then group and sum sales.
(df.assign(dates=df.dates.dt.strftime('%m/%Y'))
.groupby(['store#','dept#','dates'])
.sum()
.reset_index()
)
Out[243]:
store# dept# dates weeklysales
0 1 1 01/2005 288000
1 1 1 02/2005 245000
2 1 1 03/2005 43000
3 1 2 01/2005 165000
I have a Dataframe which looks like this :
date,time,metric_x
2016-02-27,00:00:28.0000000,31
2016-02-27,00:01:19.0000000,40
2016-02-27,00:02:55.0000000,39
2016-02-27,00:03:51.0000000,48
2016-02-27,00:05:22.0000000,42
2016-02-27,00:05:59.0000000,35
I wish to generate a new column
df['time_slot'] = df.apply(lambda row: time_slot_convert(pd.to_datetime(row['time'])), axis =1)
Where,
def time_slot_convert(time):
return time.hour + 1
This functions finds the hour for this record, plus 1.
This is extremely slow. I understand that the data is read as a string. is there a more efficient way that will speed this up?
Faster is remove apply:
df['time_slot'] = pd.to_datetime(df['time']).dt.hour + 1
print (df)
date time metric_x time_slot
0 2016-02-27 00:00:28.0000000 31 1
1 2016-02-27 00:01:19.0000000 40 1
2 2016-02-27 00:02:55.0000000 39 1
3 2016-02-27 00:03:51.0000000 48 1
4 2016-02-27 00:05:22.0000000 42 1
5 2016-02-27 00:05:59.0000000 35 1