Create new column (pandas dataframe) when duplicate ids have a payment date - python

I have a pandas dataframe:
pd.DataFrame({'id': [1, 1, 2, 2, 3, 3],
'payment_count': 1, 2, 1, 2, 1, 2,
'payment_date': ['2/2/2020', '4/6/2020', '3/20/2020', '3/29/2020', '5/1/2020', '5/30/2020']})
I want to take max('payment_count') by each 'id' and create a new column with the associated 'payment_date'. Desired output:
pd.DataFrame({'id': [1, 2, 3],
'payment_date_1': ['2/2/2020', '3/20/2020', '5/1/2020'],
'payment_date_2': ['4/6/2020', '3/29/2020', '5/30/2020']})

You can try with pivot, add_prefix, rename_axis and reset_index
df.pivot(index='id',columns='payment_count',values='payment_date_')\
.rename_axis(None, axis = 1)\
.add_prefix('payment_date')\
.reset_index()
Output:
id payment_date_1 payment_date_2
0 1 2/2/2020 4/6/2020
1 2 3/20/2020 3/29/2020
2 3 5/1/2020 5/30/2020

Another way using groupby.
df['paydate'] = df.groupby('id')['payment_date'].cumcount()+1
df['paydate'] = 'payment_date' + df['paydate'].astype(str)
df = df.set_index(['paydate','id'])['payment_date']
df = df.unstack(0).rename_axis(None)

Ugly but it does what you asked. pivot sounds better though.
groups = df.groupby('id')
args = {group[0]:group[1].payment_count.argsort() for group in groups}
records = []
for k,v in args.items():
payments = {f'payment_{i}':date
for i,date in enumerate(df.payment_date[v])}
payments['id'] = k
records.append(payments)
_df = pd.DataFrame(records)

Related

pandas groupby ID and select row with minimal value of specific columns

i want to select the whole row in which the minimal value of 3 selected columns is found, in a dataframe like this:
it is supposed to look like this afterwards:
I tried something like
dfcheckminrow = dfquery[dfquery == dfquery['A':'C'].min().groupby('ID')]
obviously it didn't work out well.
Thanks in advance!
Bkeesey's answer looks like it almost got you to your solution. I added one more step to get the overall minimum for each group.
import pandas as pd
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3],
'C': [1, 2, 3, 4, 5, 6],
})
# set "ID" as the index
df = df.set_index('ID')
# get the min for each column
mindf = df[['A','B']].groupby('ID').transform('min')
# get the min between columns and add it to df
df['min'] = mindf.apply(min, axis=1)
# filter df for when A or B matches the min
df2 = df.loc[(df['A'] == df['min']) | (df['B'] == df['min'])]
print(df2)
In my simplified example, I'm just finding the minimum between columns A and B. Here's the output:
A B C min
ID
1 14 1 2 1
2 100 2 3 2
3 1 100 5 1
One method do filter the initial DataFrame based on a groupby conditional could be to use transform to find the minimum for a "ID" group and then use loc to filter the initial DataFrame where `any(axis=1) (checking rows) is met.
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3]})
# set "ID" as the index
df = df.set_index('ID')
Sample df:
A B
ID
1 30 10
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
Use groupby and transform to find minimum value based on "ID" group.
Then use loc to filter initial df to where any(axis=1) is valid
df.loc[(df == df.groupby('ID').transform('min')).any(axis=1)]
Output:
A B
ID
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
In this example only the first row should be removed as it in both columns is not a minimum for the "ID" group.

Struggling to understand groupby pandas

I'm struggling to understand how the parameters for df.groupby works. I have the following code:
df = pd.read_sql(query_cnxn)
codegroup = df.groupby(['CODE'])
I then attempt a for loop as follows:
for code in codegroup:
dfsize = codegroup.size()
dfmax = codegroup['ID'].max()
dfmin = codegroup['ID'].min()
result = ((dfmax-dfmin)-dfsize)
if result == 1:
df2 = df2.append(itn)
else:
df3 = df3.append(itn)
I'm trying to iterate over each unique code. Does the for loop understand that i'm trying to loop through each code based on the above? Thank you in advance.
Pandas groupby returns an iterator that emits the key of the iterating group and group df as a tuple. You can perform your max and min operation on the group as:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': [0, 0, 0, 1, 1, 1], 'b': [3, 4, 5, 6, 7, 8]})
In [3]: for k, g in df.groupby('a'):
...: print(g['b'].max())
...:
5
8
You can also get the min-max directly as df using agg:
In [4]: df.groupby('a')['b'].agg(['max', 'min'])
Out[4]:
max min
a
0 5 3
1 8 6

extract column information from a single df and input to multiple dfs where identifier needs remapping

I need to append the row data from a column in df1 into separate dfs.
The row value from column 'i1' in df1 should correspond to the name of the dataframe that it needs appending too and there is a common id column across dataframes.
However - the i1 name and the name of the tables are different. I have created a dictionary below so you can see what i mean.
d_map = {'ab1':'c30_sab1',
'cd2':'kjm_1cd2'}
example data and the expected output is shown below is shown with df1. Any pointers would be great. thanks so much
df1
df = pd.DataFrame(data={'id': [1, 1, 2, 2, 3], 'i1': ['ab1','cd2','ab1','cd2','ab1'], 'i2': ['10:25','10:27','11:51','12:01','13:18']})
tables that need appending with i2 column from df1 depending on id and i1 match
c30_sab = pd.DataFrame(data={'id': [1, 2, 3]})
kjm_1cd = pd.DataFrame(data={'id': [1, 2]})
expected output
e_ab1 = pd.DataFrame(data={'id': [1, 2, 3], 'i2': ['10:25','11:51','13:18']})
e_cd2 = pd.DataFrame(data={'id': [1, 2], 'i2': ['10:27','12:01']})
A simple way to do it (assuming you accept repetitions when the df ids are duplicated):
df_ab1 = df[df['i1'] == 'ab1'] # select only the values for 'ab1' df
df_cd2 = df[df['i1'] == 'cd2'] # select only the values for 'cd2' df
e_ab_1 = ab1.merge(df_ab1[['id', 'i2']], on='id')
e_cd_2 = cd2.merge(df_cd2[['id', 'i2']], on='id')

Average by value duplicated pandas python

I have the next csv and I need get the values duplicated from DialedNumer column and then the averege Duration of those duplicates.
I already got the duplicates with the next code:
df = pd.read_csv('cdrs.csv')
dnidump = pd.DataFrame(df, columns=['DialedNumber'])
pd.options.display.float_format = '{:.0f}'.format
dupl_dni = dnidump.pivot_table(index=['DialedNumber'], aggfunc='size')
a1 = dupl_dni.to_frame().rename(columns={0:'TimesRepeated'}).sort_values(by=['TimesRepeated'], ascending=False)
b = a1.head(10)
print(b)
Output:
DialedNumber TimesRepeated
50947740194 4
50936564292 2
50931473242 3
I can't figure out how to get the duration avarege of those duplicates, any ideas?
thx
try:
df_mean = df.groupby('DialedNumber').mean()
Use df.groupby('column').mean()
Here is sample code.
Input
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [2461, 1023, 9, 5614, 212],
'C': [2, 4, 8, 16, 32]}, columns=['A', 'B', 'C'])
df.groupby('A').mean()
Output
B C
A
1 1164.333333 4.666667
2 2913.000000 24.000000
API reference of pandas.core.groupby.GroupBy.mean
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.mean.html

Faster Way to GroupBy Apply Python Pandas?

How can I make the Groupby Apply run faster, or how can I write it a different way?
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,2,2],\
'value':[1,2,np.nan,3,np.nan,1,2,np.nan,4,np.nan]})
result = df.groupby("ID").apply(lambda x: len(x[x['value'].notnull()].index)\
if((len(x[x['value']==1].index)>=1)&\
(len(x[x['value']==4].index)==0)) else 0)
output:
Index 0
1 3
2 0
My program runs very slow right now. Can I make it faster? I have in the past filtered before using groupby() but I don't see an easy way to do it in this situation.
Not sure if this is what you need. I have decomposed it a bit, but you can easily method-chain it to get the code more compact:
df = pd.DataFrame(
{
"ID": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
"value": [1, 2, np.nan, 3, np.nan, 1, 2, np.nan, 4, np.nan],
}
)
df["x1"] = df["value"] == 1
df["x2"] = df["value"] == 4
df2 = df.groupby("ID").agg(
y1=pd.NamedAgg(column="x1", aggfunc="max"),
y2=pd.NamedAgg(column="x2", aggfunc="max"),
cnt=pd.NamedAgg(column="value", aggfunc="count"),
)
df3 = df2.assign(z=lambda x: (x['y1'] & ~x['y2'])*x['cnt'])
result = df3.drop(columns=['y1', 'y2', 'cnt'])
print(result)
which will yield
z
ID
1 3
2 0

Categories

Resources