Pandas: count some values in a column - python

I have dataframe, it's part of them
ID,"url","app_name","used_at","active_seconds","device_connection","device_os","device_type","device_usage"
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:29:11,13,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:33:00,3,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:33:07,1,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:34:30,5,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-06-01 09:36:22,133,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-05-02 09:38:40,5,3g,android,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Yandex.Navigator,2015-05-01 11:04:48,70,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-6-01 12:02:27,248,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Viber,2015-07-01 12:06:35,7,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-08-01 12:23:26,86,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-02 12:24:52,0,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",My Talking Angela,2015-08-03 12:24:52,167,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-04 12:27:39,34,3g,ios,smartphone,home
I need to count quantity of days in every month to every ID.
If I try df.groupby('ID')['used_at'].count() I get quantity of visiting, how can I take and count days at month?

I think you need groupby by ID, month and day and aggregate size:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.month,df.used_at.dt.day ]).size()
print (df1)
ID used_at used_at
574c4969b017ae6481db9a7c77328bc3 5 1 1
6 1 1
7 1 1
8 1 1
2 1
3 1
4 1
e990fae0f48b7daf52619b5ccbec61bc 5 1 2
2 1
6 1 3
dtype: int64
Or by date - it is same as by year, month and day:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.date]).size()
print (df1)
ID used_at
574c4969b017ae6481db9a7c77328bc3 2015-05-01 1
2015-06-01 1
2015-07-01 1
2015-08-01 1
2015-08-02 1
2015-08-03 1
2015-08-04 1
e990fae0f48b7daf52619b5ccbec61bc 2015-05-01 2
2015-05-02 1
2015-06-01 3
dtype: int64
Differences between count and size:
size counts NaN values, count does not.

Related

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

Get consecutive occurrences of an event by group in pandas

I'm working with a DataFrame that has id, wage and date, like this:
id wage date
1 100 201212
1 100 201301
1 0 201302
1 0 201303
1 120 201304
1 0 201305
.
2 0 201302
2 0 201303
And I want to create a n_months_no_income column that counts how many consecutive months a given individual has got wage==0, like this:
id wage date n_months_no_income
1 100 201212 0
1 100 201301 0
1 0 201302 1
1 0 201303 2
1 120 201304 0
1 0 201305 1
. .
2 0 201302 1
2 0 201303 2
I feel it's some sort of mix between groupby('id') , cumcount(), maybe diff() or apply() and then a fillna(0), but I'm not finding the right one.
Do you have any ideas?
Here's an example for the dataframe for ease of replication:
df = pd.DataFrame({'id':[1,1,1,1,1,1,2,2],'wage':[100,100,0,0,120,0,0,0],
'date':[201212,201301,201302,201303,201304,201305,201302,201303]})
Edit: Added code for ease of use.
In your case two groupby with cumcount and create the addtional key with cumsum
df.groupby('id').wage.apply(lambda x : x.groupby(x.ne(0).cumsum()).cumcount())
Out[333]:
0 0
1 0
2 1
3 2
4 0
5 1
Name: wage, dtype: int64

Grouping by date range with pandas

I am looking to group by two columns: user_id and date; however, if the dates are close enough, I want to be able to consider the two entries part of the same group and group accordingly. Date is m-d-y
user_id date val
1 1-1-17 1
2 1-1-17 1
3 1-1-17 1
1 1-1-17 1
1 1-2-17 1
2 1-2-17 1
2 1-10-17 1
3 2-1-17 1
The grouping would group by user_id and dates +/- 3 days from each other. so the group by summing val would look like:
user_id date sum(val)
1 1-2-17 3
2 1-2-17 2
2 1-10-17 1
3 1-1-17 1
3 2-1-17 1
Any way someone could think of that this could be done (somewhat) easily? I know there are some problematic aspects of this. for example, what to do if the dates string together endlessly with three days apart. but the exact data im using only has 2 values per person..
Thanks!
I'd convert this to a datetime column and then use pd.TimeGrouper:
dates = pd.to_datetime(df.date, format='%m-%d-%y')
print(dates)
0 2017-01-01
1 2017-01-01
2 2017-01-01
3 2017-01-01
4 2017-01-02
5 2017-01-02
6 2017-01-10
7 2017-02-01
Name: date, dtype: datetime64[ns]
df = (df.assign(date=dates).set_index('date')
.groupby(['user_id', pd.TimeGrouper('3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Similar solution using pd.Grouper:
df = (df.assign(date=dates)
.groupby(['user_id', pd.Grouper(key='date', freq='3D')])
.sum()
.reset_index())
print(df)
user_id date val
0 1 2017-01-01 3
1 2 2017-01-01 2
2 2 2017-01-10 1
3 3 2017-01-01 1
4 3 2017-01-31 1
Update: TimeGrouper will be deprecated in future versions of pandas, so Grouper would be preferred in this scenario (thanks for the heads up, Vaishali!).
I come with a very ugly solution but still work...
df=df.sort_values(['user_id','date'])
df['Key']=df.sort_values(['user_id','date']).groupby('user_id')['date'].diff().dt.days.lt(3).ne(True).cumsum()
df.groupby(['user_id','Key'],as_index=False).agg({'val':'sum','date':'first'})
Out[586]:
user_id Key val date
0 1 1 3 2017-01-01
1 2 2 2 2017-01-01
2 2 3 1 2017-01-10
3 3 4 1 2017-01-01
4 3 5 1 2017-02-01

Merge DateTimeIndex with month-year

I have two data frames. One has a precise (daily) DateTimeIndex. I have used that index to create monthly statistics using groupby(['groupid', pd.TimeGrouper('1M', closed='left', label='left')]).
Now I would like to merge the information back to the original data frame. However, the date-time labels of the collapsed data frame do of course not correspond exactly to the original DateTimeIndex. So then I'd like to match them to the corresponding month-year information.
How would I do that?
statistics
date groupid
2001-01-31 1 10
2001-02-31 1 11
and original data frame
date groupid foo
2001-01-25 1 1
2001-01-28 1 2
2001-02-02 1 4
With expected output
date groupid foo statistics
2001-01-25 1 1 10
2001-01-28 1 2 10
2001-02-02 1 4 11
You can create new columns with month period by to_period and then merge, also is necessary change 2001-02-31 to 2001-02-28 in df1, because 31. February does not exist:
df1['per'] = df1.index.get_level_values('date').to_period('M')
df2['per'] = df2.date.dt.to_period('M')
print (df1)
statistics per
date groupid
2001-01-31 1 10 2001-01
2001-02-28 1 11 2001-02
print (df2)
date groupid foo per
0 2001-01-25 1 1 2001-01
1 2001-01-28 1 2 2001-01
2 2001-02-02 1 4 2001-02
print (pd.merge(df2, df1.reset_index(level=1), on=['per','groupid'], how='right')
.drop('per', axis=1))
date groupid foo statistics
0 2001-01-25 1 1 10
1 2001-01-28 1 2 10
2 2001-02-02 1 4 11

Getting time difference per unique row items using pandas

Can someone please show me how to use pandas to get time difference per unique rows in the following data (df):
Round Order Date
1 1 2011.02.04 00:20:21
1 2 2011.02.04 00:25:11
1 3 2011.02.04 00:35:10
1 4 2011.02.04 00:47:10
2 1 2011.02.04 00:21:21
2 2 2011.02.04 00:31:11
2 3 2011.02.04 00:41:10
Because of the sequential order i column 'Order', the time difference will be the date value in row 4 minus the date value in row 1. So I want to arrive at this table (time_df):
Round TimeDiff
1 26.39
2 19.39
You can use groupby with difference min and max:
df['Date'] = pd.to_datetime(df['Date'], format='%Y.%m.%d %H:%M:%S')
print df
Round Order Date
0 1 1 2011-02-04 00:20:21
1 1 2 2011-02-04 00:25:11
2 1 3 2011-02-04 00:35:10
3 1 4 2011-02-04 00:47:10
4 2 1 2011-02-04 00:21:21
5 2 2 2011-02-04 00:31:11
6 2 3 2011-02-04 00:41:10
print df.groupby('Round')['Date'].apply(lambda x: x.max() - x.min())
Round
1 00:26:49
2 00:19:49
Name: Date, dtype: timedelta64[ns]
I would do it this way:
In [324]: df
Out[324]:
Round Order Date
0 1 1 2011-02-04 00:20:21
1 1 2 2011-02-04 00:25:11
2 1 3 2011-02-04 00:35:10
3 1 4 2011-02-04 00:47:10
4 2 1 2011-02-04 00:21:21
5 2 2 2011-02-04 00:31:11
6 2 3 2011-02-04 00:41:10
In [325]: grp = df.groupby('Round')
In [327]: grp.Date.max()-grp.Date.min()
Out[327]:
Round
1 00:26:49
2 00:19:49
Name: Date, dtype: timedelta64[ns]

Categories

Resources