Merge DateTimeIndex with month-year - python

I have two data frames. One has a precise (daily) DateTimeIndex. I have used that index to create monthly statistics using groupby(['groupid', pd.TimeGrouper('1M', closed='left', label='left')]).
Now I would like to merge the information back to the original data frame. However, the date-time labels of the collapsed data frame do of course not correspond exactly to the original DateTimeIndex. So then I'd like to match them to the corresponding month-year information.
How would I do that?
statistics
date groupid
2001-01-31 1 10
2001-02-31 1 11
and original data frame
date groupid foo
2001-01-25 1 1
2001-01-28 1 2
2001-02-02 1 4
With expected output
date groupid foo statistics
2001-01-25 1 1 10
2001-01-28 1 2 10
2001-02-02 1 4 11

You can create new columns with month period by to_period and then merge, also is necessary change 2001-02-31 to 2001-02-28 in df1, because 31. February does not exist:
df1['per'] = df1.index.get_level_values('date').to_period('M')
df2['per'] = df2.date.dt.to_period('M')
print (df1)
statistics per
date groupid
2001-01-31 1 10 2001-01
2001-02-28 1 11 2001-02
print (df2)
date groupid foo per
0 2001-01-25 1 1 2001-01
1 2001-01-28 1 2 2001-01
2 2001-02-02 1 4 2001-02
print (pd.merge(df2, df1.reset_index(level=1), on=['per','groupid'], how='right')
.drop('per', axis=1))
date groupid foo statistics
0 2001-01-25 1 1 10
1 2001-01-28 1 2 10
2 2001-02-02 1 4 11

Related

Pandas group by date with subcategories and sums

I have a dataframe such as this one:
Date Category1 Cat2 Cat3 Cat4 Value
0 2021-02-02 4310 0 1 0 1082.00
1 2021-02-03 5121 2 0 0 -210.82
2 2021-02-03 4310 0 0 0 238.41
3 2021-02-12 5121 2 2 0 -1489.11
4 2021-02-25 6412 1 0 0 -30.97
5 2021-03-03 5121 1 1 0 -189.91
6 2021-03-09 6412 0 0 0 238.41
7 2021-03-13 5121 0 0 0 -743.08
Date column has been converted into datetime format, Value is a float, other columns are strings.
I am trying to group the dataframe by month and by each level of category, such as:
Level 1 = filter over category 1 and sum values for each category for each month:
Date Category1 Value
0 2021-02 4310 1320.41
1 2021-02 5121 -1699.93
2 2021-02 6412 -30.97
3 2021-03 5121 -1489.11
4 2021-03 6412 -932.99
Level 2 = filter over category 2 alone (one output dataframe) and over the concatenation of category 1 + 2 (another output dataframe):
Date Cat2 Value
0 2021-02 0 1320.41
1 2021-02 1 -1699.93
2 2021-02 2 -30.97
3 2021-03 0 -504.67
4 2021-03 1 -189.91
Second output :
Date Cat1+2 Value
0 2021-02 43100 1320.41
1 2021-02 51212 -1699.93
2 2021-02 64121 -30.97
3 2021-03 51210 -743.08
4 2021-03 51211 -189.91
5 2021-03 64120 238.41
Level 3 : filter over category 3 alone and over category 1+2+3
etc.
I am able to do one grouping at a time (by date or by category) but I can't combine them.
Grouping by date:
df.groupby(df["Date"].dt.year)
Grouping by category:
df.groupby('Category1')['Value'].sum()
You can try this.
To group by month, you can use this example
df.groupby(df['Date'].dt.strftime('%B'))['Value'].sum()
How can I Group By Month from a Date field using Python/Pandas
For group by multiple columns
df.groupby(['col5', 'col2'])
You could create a Month year column and they group by using the new column.
Pandas DataFrame Groupby two columns and get counts ,
Extracting just Month and Year separately from Pandas Datetime column

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

Get consecutive occurrences of an event by group in pandas

I'm working with a DataFrame that has id, wage and date, like this:
id wage date
1 100 201212
1 100 201301
1 0 201302
1 0 201303
1 120 201304
1 0 201305
.
2 0 201302
2 0 201303
And I want to create a n_months_no_income column that counts how many consecutive months a given individual has got wage==0, like this:
id wage date n_months_no_income
1 100 201212 0
1 100 201301 0
1 0 201302 1
1 0 201303 2
1 120 201304 0
1 0 201305 1
. .
2 0 201302 1
2 0 201303 2
I feel it's some sort of mix between groupby('id') , cumcount(), maybe diff() or apply() and then a fillna(0), but I'm not finding the right one.
Do you have any ideas?
Here's an example for the dataframe for ease of replication:
df = pd.DataFrame({'id':[1,1,1,1,1,1,2,2],'wage':[100,100,0,0,120,0,0,0],
'date':[201212,201301,201302,201303,201304,201305,201302,201303]})
Edit: Added code for ease of use.
In your case two groupby with cumcount and create the addtional key with cumsum
df.groupby('id').wage.apply(lambda x : x.groupby(x.ne(0).cumsum()).cumcount())
Out[333]:
0 0
1 0
2 1
3 2
4 0
5 1
Name: wage, dtype: int64

Create a dataframe based on column values of another dataframe

I have a dataframe as 20000 X 50. Two of the columns are Date and Time (represented as hour). Remaining columns have observations of some parameters during the time. What I am trying to achieve is create a new dataframe which averages all the remaining column values for every 3 hours per day and creates a an ID columns for this which can be numbers from 1 to 8. Each representing 3 hour range.
I have enclosed an image of the source and what should be result. Any help is very much appreciated.
Data
Use groupby by column Date and column Hour created by sub by 1 and floordiv with add with aggregate mean:
df['Hour'] = df['Hour'].sub(1).floordiv(3).add(1)
df = df.groupby(['Date', 'Hour'], as_index=False).mean()
print (df)
Date Hour col1 col2 col3
0 05/01/2018 1 5.333333 5.333333 7.666667
1 05/01/2018 2 6.000000 6.000000 4.000000
2 06/01/2018 1 4.000000 6.333333 7.000000
3 06/01/2018 3 6.000000 6.000000 3.666667
Detail:
print (df['Hour'].sub(1).floordiv(3).add(1))
0 1
1 1
2 1
3 2
4 1
5 1
6 1
7 3
8 3
9 3
Name: Hour, dtype: int64

Pandas: count some values in a column

I have dataframe, it's part of them
ID,"url","app_name","used_at","active_seconds","device_connection","device_os","device_type","device_usage"
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:29:11,13,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-05-01 09:33:00,3,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:33:07,1,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Phone,2015-06-01 09:34:30,5,unknown,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-06-01 09:36:22,133,3g,android,smartphone,home
e990fae0f48b7daf52619b5ccbec61bc,"",Messaging,2015-05-02 09:38:40,5,3g,android,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Yandex.Navigator,2015-05-01 11:04:48,70,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-6-01 12:02:27,248,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Viber,2015-07-01 12:06:35,7,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",VK Client,2015-08-01 12:23:26,86,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-02 12:24:52,0,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",My Talking Angela,2015-08-03 12:24:52,167,3g,ios,smartphone,home
574c4969b017ae6481db9a7c77328bc3,"",Talking Angela,2015-08-04 12:27:39,34,3g,ios,smartphone,home
I need to count quantity of days in every month to every ID.
If I try df.groupby('ID')['used_at'].count() I get quantity of visiting, how can I take and count days at month?
I think you need groupby by ID, month and day and aggregate size:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.month,df.used_at.dt.day ]).size()
print (df1)
ID used_at used_at
574c4969b017ae6481db9a7c77328bc3 5 1 1
6 1 1
7 1 1
8 1 1
2 1
3 1
4 1
e990fae0f48b7daf52619b5ccbec61bc 5 1 2
2 1
6 1 3
dtype: int64
Or by date - it is same as by year, month and day:
df1 = df.used_at.groupby([df['ID'], df.used_at.dt.date]).size()
print (df1)
ID used_at
574c4969b017ae6481db9a7c77328bc3 2015-05-01 1
2015-06-01 1
2015-07-01 1
2015-08-01 1
2015-08-02 1
2015-08-03 1
2015-08-04 1
e990fae0f48b7daf52619b5ccbec61bc 2015-05-01 2
2015-05-02 1
2015-06-01 3
dtype: int64
Differences between count and size:
size counts NaN values, count does not.

Categories

Resources