Grouping in pandas and assign repetition number (first, second, third) - python

I have a python pandas dataframe that looks like this:
date userid
2017-03 a
2017-04 b
2017-06 b
2017-08 b
2017-05 c
2017-08 c
I would like create a third column that indicates the number of times that the sample was repeated at that date, so the frame looks like this:
date userid repetition
2017-03 a 1
2017-04 b 1
2017-06 b 2
2017-08 b 3
2017-05 c 1
2017-08 c 2
So far, I grouped it by userid and date, but I only found the way to get the total counts
data['newcol'] = data.groupby(['sampleid'])['date'].transform('count')
Thank you very much!!

Use cumcount
In [282]: df.groupby('userid').cumcount().add(1)
Out[282]:
0 1
1 1
2 2
3 3
4 1
5 2
dtype: int64
In [283]: df.assign(repetition=df.groupby('userid').cumcount().add(1))
Out[283]:
date userid repetition
0 2017-03 a 1
1 2017-04 b 1
2 2017-06 b 2
3 2017-08 b 3
4 2017-05 c 1
5 2017-08 c 2
Or, assign
In [285]: df['repetition'] = df.groupby('userid').cumcount().add(1)
In [286]: df
Out[286]:
date userid repetition
0 2017-03 a 1
1 2017-04 b 1
2 2017-06 b 2
3 2017-08 b 3
4 2017-05 c 1
5 2017-08 c 2

Related

merge to replace nans of the same column in pandas dataframe?

I have the following dataframe to which I want to merge multiple dataframes to, this df consist of ID, date, and many other variables..
ID date ..other variables...
A 2017Q1
A 2017Q2
A 2018Q1
B 2017Q1
B 2017Q2
B 2017Q3
C 2018Q1
C 2018Q2
.. ..
And i have a bunch of dataframes (by quarter) that has asset holdings information
df_2017Q1:
ID date asset_holdings
A 2017Q1 1
B 2017Q1 2
C 2017Q1 4
...
df_2017Q2
ID date asset_holdings
A 2017Q2 2
B 2017Q2 5
C 2017Q2 4
...
df_2017Q3
ID date asset_holdings
A 2017Q3 1
B 2017Q3 2
C 2017Q3 10
...
df_2017Q4..
ID date asset_holdings
A 2017Q4 10
B 2017Q4 20
C 2017Q4 14
...
df_2018Q1..
ID date asset_holdings
A 2018Q1 11
B 2018Q1 23
C 2018Q1 15
...
df_2018Q2...
ID date asset_holdings
A 2018Q2 11
B 2018Q2 26
C 2018Q2 19
...
....
desired output
ID date asset_holdings ..other variables...
A 2017Q1 1
A 2017Q2 2
A 2018Q1 11
B 2017Q1 2
B 2017Q2 5
B 2017Q3 2
C 2018Q1 15
C 2018Q2 19
.. ..
I think merging on ID and date, should do it but this will create + n columns which I do not want, so I want to create a column "asset_holdings" and merge the right dfs while updating NAN values. But not sure if this is the smartest way. Any help will be appreciated!
Try to use pd.concat() to concatenate your different DataFrames and then use sort_values(['ID', 'date']) to sort the values by the columns ID and date.
See the example below as demonstration.
import pandas as pd
df1 = pd.DataFrame({'ID':list('ABCD'), 'date':['2017Q1']*4, 'other':[1,2,3,4]})
df2 = pd.DataFrame({'ID':list('ABCD'), 'date':['2017Q2']*4, 'other':[4,3,2,1]})
df3 = pd.DataFrame({'ID':list('ABCD'), 'date':['2018Q1']*4, 'other':[7,6,5,4]})
ans = pd.concat([df1, df2, df3]).sort_values(['ID', 'date'], ignore_index=True)
>>> ans
ID date other
0 A 2017Q1 1
1 A 2017Q2 4
2 A 2018Q1 7
3 B 2017Q1 2
4 B 2017Q2 3
5 B 2018Q1 6
6 C 2017Q1 3
7 C 2017Q2 2
8 C 2018Q1 5
9 D 2017Q1 4
10 D 2017Q2 1
11 D 2018Q1 4

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

replace values in dataframe based in other dataframe filter

I have 2 DataFrames, and I want to replace the values in one dataframe, with the values of the other dataframe, base on the columns on the first one. I put the compositions to clarify.
DF1:
A B C D E
Date
01/01/2019 1 2 3 4 5
02/01/2019 1 2 3 4 5
03/01/2019 1 2 3 4 5
DF2:
name1 name2 name3
Date
01/01/2019 A B D
02/01/2019 B C E
03/01/2019 A D E
THE RESULT I WANT:
name1 name2 name3
Date
01/01/2019 1 2 4
02/01/2019 2 3 5
03/01/2019 1 4 5
Try:
result = df2.melt(id_vars="index").merge(
df1.melt(id_vars="index"),
left_on=["index", "value"],
right_on=["index", "variable"],
).drop(columns=["value_x", "variable_y"]).pivot(
index="index", columns="variable_x", values="value_y"
)
print(result)
The two melt's transform your dataframes to only contain the numbers in one column, and an additional column for the orignal column names:
df1.melt(id_vars='index')
index variable value
0 01/01/2019 A 1
1 02/01/2019 A 1
2 03/01/2019 A 1
3 01/01/2019 B 2
4 02/01/2019 B 2
5 03/01/2019 B 2
...
These you can now join on index and value/variable. The last part is just removing a couple of columns and then reshaping the table back to the desired form.
The result is
variable_x name1 name2 name3
index
01/01/2019 1 2 4
02/01/2019 2 3 5
03/01/2019 1 4 5
Use DataFrame.lookup for each column separately:
for c in df2.columns:
df2[c] = df1.lookup(df1.index, df2[c])
print (df2)
name1 name2 name3
01/01/2019 1 2 4
02/01/2019 2 3 5
03/01/2019 1 4 5
General solution is possible different index and columns names:
print (df1)
A B C D G
01/01/2019 1 2 3 4 5
02/01/2019 1 2 3 4 5
05/01/2019 1 2 3 4 5
print (df2)
name1 name2 name3
01/01/2019 A B D
02/01/2019 B C E
08/01/2019 A D E
df1.index = pd.to_datetime(df1.index, dayfirst=True)
df2.index = pd.to_datetime(df2.index, dayfirst=True)
cols = df2.stack().unique()
idx = df2.index
df11 = df1.reindex(columns=cols, index=idx)
print (df11)
A B D C E
2019-01-01 1.0 2.0 4.0 3.0 NaN
2019-01-02 1.0 2.0 4.0 3.0 NaN
2019-01-08 NaN NaN NaN NaN NaN
for c in df2.columns:
df2[c] = df11.lookup(df11.index, df2[c])
print (df2)
name1 name2 name3
2019-01-01 1.0 2.0 4.0
2019-01-02 2.0 3.0 NaN
2019-01-08 NaN NaN NaN

How to calculate time difference by group using pandas?

Problem
I want to calculate diff by group. And I don’t know how to sort the time column so that each group results are sorted and positive.
The original data :
In [37]: df
Out[37]:
id time
0 A 2016-11-25 16:32:17
1 A 2016-11-25 16:36:04
2 A 2016-11-25 16:35:29
3 B 2016-11-25 16:35:24
4 B 2016-11-25 16:35:46
The result I want
Out[40]:
id time
0 A 00:35
1 A 03:12
2 B 00:22
notice: the type of time col is timedelta64[ns]
Trying
In [38]: df['time'].diff(1)
Out[38]:
0 NaT
1 00:03:47
2 -1 days +23:59:25
3 -1 days +23:59:55
4 00:00:22
Name: time, dtype: timedelta64[ns]
Don't get desired result.
Hope
Not only solve the problem but the code can run fast because there are 50 million rows.
You can use sort_values with groupby and aggregating diff:
df['diff'] = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
id time diff
0 A 2016-11-25 16:32:17 NaT
1 A 2016-11-25 16:36:04 00:00:35
2 A 2016-11-25 16:35:29 00:03:12
3 B 2016-11-25 16:35:24 NaT
4 B 2016-11-25 16:35:46 00:00:22
If need remove rows with NaT in column diff use dropna:
df = df.dropna(subset=['diff'])
print (df)
id time diff
2 A 2016-11-25 16:35:29 00:03:12
1 A 2016-11-25 16:36:04 00:00:35
4 B 2016-11-25 16:35:46 00:00:22
You can also overwrite column:
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
print (df)
id time
0 A NaT
1 A 00:00:35
2 A 00:03:12
3 B NaT
4 B 00:00:22
df.time = df.sort_values(['id','time']).groupby('id')['time'].diff()
df = df.dropna(subset=['time'])
print (df)
id time
1 A 00:00:35
2 A 00:03:12
4 B 00:00:22

Generating sub data frame based on a value in an column

I have following data frame in pandas. Now I want to generate sub data frame if I see a value in Activity column. So for example, I want to have data frame with all the data with Name A IF Activity column as value 3 or 5.
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
C 01-31-2015 1
C 01-31-2015 2
C 01-31-2015 2
So for the above data, I want to get
df_A as
Name Date Activity
A 01-02-2015 1
A 01-03-2015 2
A 01-04-2015 3
A 01-04-2015 1
df_B as
B 01-02-2015 1
B 01-02-2015 2
B 01-03-2015 1
B 01-04-2015 5
Since Name C does not have 3 or 5 in the column Activity, I do not want to get this data frame.
Also, the names in the data frame can vary with each input file.
Once I have these data frame separated, I want to plot a time series.
You can groupby dataframe by column Name, apply custom function f and then select dataframes df_A and df_B:
print df
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
8 C 2015-01-31 1
9 C 2015-01-31 2
10 C 2015-01-31 2
def f(df):
if ((df['Activity'] == 3) | (df['Activity'] == 5)).any():
return df
g = df.groupby('Name').apply(f).reset_index(drop=True)
df_A = g.loc[g.Name == 'A']
print df_A
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
df_B = g.loc[g.Name == 'B']
print df_B
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
df_A.plot()
df_B.plot()
In the end you can use plot - more info
EDIT:
If you want create dataframes dynamically, use can find all unique values of column Name by drop_duplicates:
for name in g.Name.drop_duplicates():
print g.loc[g.Name == name]
Name Date Activity
0 A 2015-01-02 1
1 A 2015-01-03 2
2 A 2015-01-04 3
3 A 2015-01-04 1
Name Date Activity
4 B 2015-01-02 1
5 B 2015-01-02 2
6 B 2015-01-03 1
7 B 2015-01-04 5
You can use a dictionary comprehension to create a sub dataframe for each Name with an Activity value of 3 or 5.
active_names = df[df.Activity.isin([3, 5])].Name.unique().tolist()
dfs = {name: df.loc[df.Name == name, :] for name in active_names}
>>> dfs['A']
Name Date Activity
0 A 01-02-2015 1
1 A 01-03-2015 2
2 A 01-04-2015 3
3 A 01-04-2015 1
>>> dfs['B']
Name Date Activity
4 B 01-02-2015 1
5 B 01-02-2015 2
6 B 01-03-2015 1
7 B 01-04-2015 5

Categories

Resources