I am working on a project and I was able to groupby 7D and now I want to access the elements groupedby
Here is the code:
group = df.set_index('date').groupby('user').resample('7D', convention='start', label='left')
group_result = pd.DataFrame({'Weekly_in_averge_amount': group.mean()['value'], 'Weekly_in_max_amount': group.max()['value']'Weekly_in_min_amount': group.min()['value'], 'Weekly_in_totalamount': group.sum()['value'], 'Weekly_in_degree': group.sum()['inputs'], 'monthdays': group.count()['month']})`
groupUser = group_result.groupby('user').first()
I got this output
29 1.512015 ... 1.049153
30 34.896646 ... 26.350528
37 0.055000 ... 0.002245
38 0.835067 ... 0.102253
39 38.044883 ... 9.317114
40 1.476168 ... 0.090378
41 1.000000 ... 0.061224
42 8.976852 ... 0.183201
43 0.012000 ... 0.000490
44 2.377267 ... 0.048516
45 1.365204 ... 284.463992
For example the user 29 has the transaction of one week, Is it possible to display the grouped values in user 29
user date Weekly_in_averge_amount count
29 2011-05-25 1.512015 ... 34
29 2011-06-01 1.123298 ... 23
As we can see, user 29 has grouped all rows by one week. How can I get the rows grouped by one week.
Note that there are 34 rows grouped by the first group
sorry if my explanation is not clear
Thank you for any help
Regards,
Khaled
You can use GroupBy.agg with dictionary of columns names and aggregate functions, then convert MultiIndex to columns, flatten and last rename:
np.random.seed(123)
rng = pd.date_range('2017-04-03', periods=10)
df = pd.DataFrame({'date': rng,
'value': range(10),
'inputs': range(3,13),
'month': np.random.randint(1,7, size=10),
'user':['a'] * 3 + ['b'] *3 + ['c'] *4})
print (df)
date value inputs month user
0 2017-04-03 0 3 6 a
1 2017-04-04 1 4 3 a
2 2017-04-05 2 5 5 a
3 2017-04-06 3 6 3 b
4 2017-04-07 4 7 2 b
5 2017-04-08 5 8 4 b
6 2017-04-09 6 9 3 c
7 2017-04-10 7 10 4 c
8 2017-04-11 8 11 2 c
9 2017-04-12 9 12 2 c
df1 = (df.set_index('date')
.groupby('user')
.resample('7D', convention='start', label='left')
.agg({'value': ['mean','max','min','sum'],
'inputs':'sum',
'month':'count'}))
df1.columns = df1.columns.map('_'.join)
d = {'value_max':'Weekly_in_max_amount',
'value_min':'Weekly_in_min_amount',
'value_sum':'Weekly_in_totalamount',
'inputs_sum':'Weekly_in_degree',
'month_count':'monthdays',
'value_mean':'Weekly_in_averge_amount'}
df1 = df1.rename(columns=d).reset_index()
print (df1)
user date Weekly_in_averge_amount Weekly_in_max_amount \
0 a 2017-04-03 1.0 2
1 b 2017-04-06 4.0 5
2 c 2017-04-09 7.5 9
Weekly_in_min_amount Weekly_in_totalamount Weekly_in_degree monthdays
0 0 3 12 3
1 3 12 21 3
2 6 30 42 4
Related
Here is data
id
date
population
1
2021-5
21
2
2021-5
22
3
2021-5
23
4
2021-5
24
1
2021-4
17
2
2021-4
24
3
2021-4
18
4
2021-4
29
1
2021-3
20
2
2021-3
29
3
2021-3
17
4
2021-3
22
I want to calculate the monthly change regarding population in each id. so result will be:
id
date
delta
1
5
.2353
1
4
-.15
2
5
-.1519
2
4
-.2083
3
5
.2174
3
4
.0556
4
5
-.2083
4
4
.3182
delta := (this month - last month) / last month
How to approach this in pandas? I'm thinking of groupby but don't know what to do next
remember there might be more dates. but results is always
Use GroupBy.pct_change with sorting columns first before, last remove misisng rows by column delta:
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['id','date'], ascending=[True, False])
df['delta'] = df.groupby('id')['population'].pct_change(-1)
df = df.dropna(subset=['delta'])
print (df)
id date population delta
0 1 2021-05-01 21 0.235294
4 1 2021-04-01 17 -0.150000
1 2 2021-05-01 22 -0.083333
5 2 2021-04-01 24 -0.172414
2 3 2021-05-01 23 0.277778
6 3 2021-04-01 18 0.058824
3 4 2021-05-01 24 -0.172414
7 4 2021-04-01 29 0.318182
Try this:
df.groupby('id')['population'].rolling(2).apply(lambda x: (x.iloc[0] - x.iloc[1]) / x.iloc[0]).dropna()
maybe you could try something like:
data['delta'] = data['population'].diff()
data['delta'] /= data['population']
with this approach the first line would be NaNs, but for the rest, this should work.
I have two table like that:
Customr Issue Date_Issue
1 1 01/01/2019
1 2 03/06/2019
1 3 04/07/2019
1 4 13/09/2019
2 5 01/02/2019
2 6 16/03/2019
2 7 20/08/2019
2 8 30/08/2019
2 9 01/09/2019
3 10 01/02/2019
3 11 03/02/2019
3 12 05/03/2019
3 13 20/04/2019
3 14 25/04/2019
3 15 13/05/2019
3 16 20/05/2019
3 17 25/05/2019
3 18 01/06/2019
3 19 03/07/2019
3 20 20/08/2019
Customr Date_Survey df_Score
1 06/04/2019 10
2 10/06/2019 9
3 01/08/2019 3
And I need to obtain the number of issues of each customer in the three month before the date of survey.
But I can not get this query in Pandas.
#first table
index_survey = [0,1,2]
Customer_Survey = pd.Series([1,2,3],index= index_survey)
Date_Survey = pd.Series(["06/04/2019","10/06/2019","01/08/2019"])
df_Score=[10, 9, 3]
df_survey = pd.DataFrame(Customer_Survey,columns = ["Customer_Survey"])
df_survey["Date_Survey"] =Date_Survey
df_survey["df_Score"] =df_Score
#And second table
index_survey = [0,1,2]
Customer_Survey = pd.Series([1,2,3],index= index_survey)
Date_Survey = pd.Series(["06/04/2019","10/06/2019","01/08/2019"])
df_Score=[10, 9, 3]
df_survey = pd.DataFrame(Customer_Survey,columns = ["Customer_Survey"])
df_survey["Date_Survey"] =Date_Survey
df_survey["df_Score"] =df_Score
I expect the result
Custr Date_Survey Score Count_issues
1 06/04/2019 10 0
2 10/06/2019 9 1
3 01/08/2019 3 5
Use:
#convert columns to datetimes
df1['Date_Issue'] = pd.to_datetime(df1['Date_Issue'], dayfirst=True)
df2['Date_Survey'] = pd.to_datetime(df2['Date_Survey'], dayfirst=True)
#create datetimes for 3 months before
df2['Date1'] = df2['Date_Survey'] - pd.offsets.DateOffset(months=3)
#merge together
df = df1.merge(df2, on='Customr')
#filter by between, select only Customr and get counts
s = df.loc[df['Date_Issue'].between(df['Date1'], df['Date_Survey']), 'Customr'].value_counts()
#map to new column and replace NaNs to 0
df2['Count_issues'] = df2['Customr'].map(s).fillna(0, downcast='int')
print (df2)
Customr Date_Survey df_Score Date1 Count_issues
0 1 2019-04-06 10 2019-01-06 0
1 2 2019-06-10 9 2019-03-10 1
2 3 2019-08-01 3 2019-05-01 5
Say we have a DataFrame df
df = pd.DataFrame({
"Id": [1, 2],
"Value": [2, 5]
})
df
Id Value
0 1 2
1 2 5
and some function f which takes an element of df and returns a DataFrame.
def f(value):
return pd.DataFrame({"A": range(10, 10 + value), "B": range(20, 20 + value)})
f(2)
A B
0 10 20
1 11 21
We want to apply f to each element in df["Value"], and join the result to df, like so:
Id Value A B
0 1 2 10 20
1 1 2 11 21
2 2 5 10 20
2 2 5 11 21
2 2 5 12 22
2 2 5 13 23
2 2 5 14 24
In T-SQL, with a table df and table-valued function f, we would do this with a CROSS APPLY:
SELECT * FROM df
CROSS APPLY f(df.Value)
How can we do this in pandas?
You could apply the function to each element in Value in a list comprehension and use pd.concat to concatenate all resulting dataframes. Also assign the corresponding Id so that it can be later on used to merge both dataframes:
l = pd.concat([f(row.Value).assign(Id=row.Id) for _, row in df.iterrows()])
df.merge(l, on='Id')
Id Value A B
0 1 2 10 20
1 1 2 11 21
2 2 5 10 20
3 2 5 11 21
4 2 5 12 22
5 2 5 13 23
6 2 5 14 24
One of the few cases I would use DataFrame.iterrows. We can iterate over each row, concat the cartesian product out of your function with the original dataframe and at the same time fillna with bfill and ffill:
df = pd.concat([pd.concat([f(r['Value']), pd.DataFrame(r).T], axis=1).bfill().ffill() for _, r in df.iterrows()],
ignore_index=True)
Which yields:
print(df)
A B Id Value
0 10 20 1.0 2.0
1 11 21 1.0 2.0
2 10 20 2.0 5.0
3 11 21 2.0 5.0
4 12 22 2.0 5.0
5 13 23 2.0 5.0
6 14 24 2.0 5.0
I have a dataframe like this:
userId date new doa
67 23 2018-07-02 1 2
68 23 2018-07-03 1 3
69 23 2018-07-04 1 4
70 23 2018-07-06 1 6
71 23 2018-07-07 1 7
72 23 2018-07-10 1 10
73 23 2018-07-11 1 11
74 23 2018-07-13 1 13
75 23 2018-07-15 1 15
76 23 2018-07-16 1 16
77 23 2018-07-17 1 17
......
194605 448053 2018-08-11 1 11
194606 448054 2018-08-11 1 11
194607 448065 2018-08-11 1 11
df['doa'] stands for day of appearance.
Now I want to find out like which unique userIds have appeared on a daily basis. Like which userIds are appearing on day1, day2, day3, and so on. So how do I exactly groupby them? And also I want to find out like the avg. no of days unique users are opening the app in a month?
And finally I want to also find out like which users have appeared at least once every day throughout the month.
I want some thing like this:
userId week_no ndays
23 1 2
23 2 5
23 3 6
.....
1533 1 0
1534 2 1
1534 3 4
1534 4 1
1553 1 1
1553 2 0
1553 3 0
1553 4 0
And so on. ndays means no. of days in a week.
You're asking several different questions, and none of them are particularly difficult, they just require a couple groupbys and aggregation operations.
Setup
df = pd.DataFrame({
'userId': [1,1,1,1,1,2,2,2,2,3,3,3,3,3],
'date': ['2018-07-02', '2018-07-03', '2018-08-04', '2018-08-05', '2018-08-06',
'2018-07-02', '2018-07-03', '2018-08-04', '2018-08-05', '2018-07-02', '2018-07-03',
'2018-07-04', '2018-07-05', '2018-08-06']
})
df.date = pd.to_datetime(df.date)
df['doa'] = df.date.dt.day
userId date doa
0 1 2018-07-02 2
1 1 2018-07-03 3
2 1 2018-08-04 4
3 1 2018-08-05 5
4 1 2018-08-06 6
5 2 2018-07-02 2
6 2 2018-07-03 3
7 2 2018-08-04 4
8 2 2018-08-05 5
9 3 2018-07-02 2
10 3 2018-07-03 3
11 3 2018-07-04 4
12 3 2018-07-05 5
13 3 2018-08-06 6
Questions
How do I find the unique visitors per day?
You may use groupby and unique:
df.groupby([df.date.dt.month, 'doa']).userId.unique()
date doa
7 2 [1, 2, 3]
3 [1, 2, 3]
4 [3]
5 [3]
8 4 [1, 2]
5 [1, 2]
6 [1, 3]
Name: userId, dtype: object
How do I find the average number of days per month users open the app?
Using groupby and size:
df.groupby(['userId', df.date.dt.month]).size()
userId date
1 7 2
8 3
2 7 2
8 2
3 7 4
8 1
dtype: int64
This will give you the number of times per month each unique visitor has visited. If you want the average of this, simply apply mean:
df.groupby(['userId', df.date.dt.month]).size().groupby('date').mean()
date
7 2.666667
8 2.000000
dtype: float64
This one was a bit more unclear, but it seems that you want the number of days a user was seen per week:
You can groupby userId, as well as a variation on your date column to create continuous weeks, starting at the minimum date, then use size:
(df.groupby(
['userId', (df.date.dt.week.sub(df.date.dt.week.min())+1).rename('week_no')])
.size().reset_index(name='ndays')
)
userId week_no ndays
0 1 1 2
1 1 5 2
2 1 6 1
3 2 1 2
4 2 5 2
5 3 1 4
6 3 6 1
I have a dataframe df1, and I have a list which contains names of several columns of df1.
df1:
User_id month day Age year CVI ZIP sex wgt
0 1 7 16 1977 2 NA M NaN
1 2 7 16 1977 3 NA M NaN
2 3 7 16 1977 2 DM F NaN
3 4 7 16 1977 7 DM M NaN
4 5 7 16 1977 3 DM M NaN
... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN
35546 35547 12 31 2002 10 RM F 14
35547 35548 12 31 2002 7 DO M 51
35548 35549 12 31 2002 5 NaN NaN NaN
list= [u"User_id", u"day", u"ZIP", u"sex"]
I want to make a new dataframe df2 which will contain omly those columns which are in the list, and a dataframe df3 which will contain columns which are not in the list.
Here I found that I need to do:
df2=df1[df1[df1.columns[1]].isin(list)]
But as a result I get:
Empty DataFrame
Columns: []
Index: []
[0 rows x 9 columns]
What Im I odoing wrong and how can i get a needed result? Why "9 columns" if it supossed to be 4?
Solution with Index.difference:
L = [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[L]
df3 = df1[df1.columns.difference(df2.columns)]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
Or:
df2 = df1[L]
df3 = df1[df1.columns.difference(pd.Index(L))]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]
or
df2 = df1[df1.columns[df1.columns.isin(my_list)]]
You can try :
df2 = df1[list] # it does a projection on the columns contained in the list
df3 = df1[[col for col in df1.columns if col not in list]]
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]