How to enumerate rows in pandas with nonunique values in groups - python

I am working with expeditions geodata. Could you help with enumeration of stations and records for the same station depending on expedition ID (ID), date (Date), latitude (Lat), longitude (Lon) and some value (Val, it is not reasonable for enumeration)? Assume that station is a group of rows with the same (ID,Date,Lat,Lon), expedition is a group of rows with the same ID.
Dataframe is sorted by 4 columns as in example.
Dataset and required columns
import pandas as pd
data = [[1,'2017/10/10',70.1,30.4,10],\
[1,'2017/10/10',70.1,31.4,20],\
[1,'2017/10/10',70.1,31.4,10],\
[1,'2017/10/10',70.1,31.4,10],\
[1,'2017/10/12',70.1,31.4,20],\
[2,'2017/12/10',70.1,30.4,20],\
[2,'2017/12/10',70.1,31.4,20]];
df = pd.DataFrame(data,columns=['ID','Date','Lat','Lon','Val']);
Additional (I need it, St for station number and Rec for record number within the same station data; output for example above):
df['St'] = [1,2,2,2,3,1,2];
df['Rec'] = [1,1,2,3,1,1,1];
print(df)
I tried and used groupby/cumcount/agg/factorize but have not solved my problem.
Any help! Thanks!

To create 'St', you can use groupby on 'ID' and then check when any of the columns 'Date','Lat','Lon' is different than the previous one using shift, and use cumsum to get the numbers you want, such as:
df['St'] = (df.groupby(['ID'])
.apply(lambda x: (x[['Date','Lat','Lon']].shift() != x[['Date','Lat','Lon']])
.any(axis=1).cumsum())).values
And to create 'Rec', you also need groupby but on all columns 'ID','Date','Lat','Lon' and then use cumcount and add such as:
df['Rec'] = df.groupby(['ID','Date','Lat','Lon']).cumcount().add(1)
and you get:
ID Date Lat Lon Val St Rec
0 1 2017/10/10 70.1 30.4 10 1 1
1 1 2017/10/10 70.1 31.4 20 2 1
2 1 2017/10/10 70.1 31.4 10 2 2
3 1 2017/10/10 70.1 31.4 10 2 3
4 1 2017/10/12 70.1 31.4 20 3 1
5 2 2017/12/10 70.1 30.4 20 1 1
6 2 2017/12/10 70.1 31.4 20 2 1

Related

Append column from one dataframe to another for rows that match in both dataframes

I have two dataframes A and B that contain different sets of patient data, and need to append certain columns from B to A - however only for those rows that contain information from the same patient and visit, i.e. where A and B have a matching value in two particular columns. B is longer than A, not all rows in A are contained in B. I don't know how this would be possible without looping, but many people discourage from looping over pandas dataframes (apart from the fact that my loop solution does not work because "Can only compare identically-labeled Series objects"). I read the options here
How to iterate over rows in a DataFrame in Pandas
but don't see which one I could apply here and would appreciate any tips!
Toy example (the actual dataframe has about 300 rows):
dict_A = {
'ID': ['A_190792','X_210392','B_050490','F_311291','K_010989'],
'Visit_Date': ['2010-10-31','2011-09-24','2010-30-01','2012-01-01','2013-08-13'],
'Score': [30, 23, 42, 23, 31],
}
A = pd.DataFrame(dict_A)
dict_B = {
'ID': ['G_090891','A_190792','Z_060791','X_210392','B_050490','F_311291','K_010989','F_230989'],
'Visit_Date': ['2013-03-01','2010-10-31','2013-04-03','2011-09-24','2010-30-01','2012-01-01','2013-08-13','2014-09-09'],
'Diagnosis': ['F12', 'G42', 'F34', 'F90', 'G98','G87','F23','G42'],
}
B = pd.DataFrame(dict_B)
for idx, row in A.iterrows():
A.loc[row,'Diagnosis'] = B['Diagnosis'][(B['Visit_Date']==A['Visit_Date']) & (B['ID']==A['ID'])]
# Appends Diagnosis column from B to A for rows where ID and date match
I have seen this question Append Columns to Dataframe 1 Based on Matching Column Values in Dataframe 2 but the only answer is quite specific to it and also does not address the question whether a loop can/should be used or not
i think you can use merge:
A['Visit_Date']=pd.to_datetime(A['Visit_Date'])
B['Visit_Date']=pd.to_datetime(B['Visit_Date'])
final=A.merge(B,on=['Visit_Date','ID'],how='outer')
print(final)
'''
ID Visit_Date Score Diagnosis
0 A_190792 2010-10-31 30.0 G42
1 X_210392 2011-09-24 23.0 F90
2 B_050490 2010-30-01 42.0 G98
3 F_311291 2012-01-01 23.0 G87
4 K_010989 2013-08-13 31.0 F23
5 G_090891 2013-03-01 NaN F12
6 Z_060791 2013-04-03 NaN F34
7 F_230989 2014-09-09 NaN G42
'''
if you want to only A:
A['Visit_Date']=pd.to_datetime(A['Visit_Date'])
B['Visit_Date']=pd.to_datetime(B['Visit_Date'])
final=A.merge(B,on=['Visit_Date','ID'],how='left')
print(final)
'''
ID Visit_Date Score Diagnosis
0 A_190792 2010-10-31 30 G42
1 X_210392 2011-09-24 23 F90
2 B_050490 2010-30-01 42 G98
3 F_311291 2012-01-01 23 G87
4 K_010989 2013-08-13 31 F23
'''

Sum two columns only if the values of one column is bigger/greater 0

I've got the following dataframe
lst=[['01012021','',100],['01012021','','50'],['01022021',140,5],['01022021',160,12],['01032021','',20],['01032021',200,25]]
df1=pd.DataFrame(lst,columns=['Date','AuM','NNA'])
I am looking for a code which sums the columns AuM and NNA only if the values of column AuM contains a value. The result is showed below:
lst=[['01012021','',100,''],['01012021','','50',''],['01022021',140,5,145],['01022021',160,12,172],['01032021','',20,'']]
df2=pd.DataFrame(lst,columns=['Date','AuM','NNA','Sum'])
It is not a good practice to use '' in place of NaN when you have numeric data.
That said, a generic solution to your issue would be to use sum with the skipna=False option:
df1['Sum'] = (df1[['AuM', 'NNA']] # you can use as many columns as you want
.apply(pd.to_numeric, errors='coerce') # convert to numeric
.sum(1, skipna=False) # sum if all are non-NaN
.fillna('') # fill NaN with empty string (bad practice)
)
output:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0
I assume you mean to include the last row too:
df2 = (df1.assign(Sum=df1.loc[df1.AuM.ne(""), ["AuM", "NNA"]].sum(axis=1))
.fillna(""))
print(df2)
Result:
Date AuM NNA Sum
0 01012021 100
1 01012021 50
2 01022021 140 5 145.0
3 01022021 160 12 172.0
4 01032021 20
5 01032021 200 25 225.0

Output raw value difference from one period to the next using Python

I have a dataset, df, where I have a new value for each day. I would like to output the percent difference of these values from row to row as well as the raw value difference:
Date Value
10/01/2020 1
10/02/2020 2
10/03/2020 5
10/04/2020 8
Desired output:
Date Value PercentDifference ValueDifference
10/01/2020 1
10/02/2020 2 100 2
10/03/2020 5 150 3
10/04/2020 8 60 3
This is what I am doing:
import pandas as pd
df = pd.read_csv('df.csv')
df = (df.merge(df.assign(Date=df['Date'] - pd.to_timedelta('1D')),
on='Date')
.assign(Value = lambda x: x['Value_y']-x['Value_x'])
[['Date','Value']]
)
df['PercentDifference'] = [f'{x:.2%}' for x in (df['Value'].div(df['Value'].shift(1)) -
1).fillna(0)]
A member has helped me with the code above, I am also trying to incorporate the value difference as shown in my desired output.
Note - Is there a way to incorporate a 'period' - say, checking the percent difference and value difference over a 7 day period and 30 day period and so on?
Any suggestion is appreciated
Use Series.pct_change and Series.diff
df['PercentageDiff'] = df['Value'].pct_change().mul(100)
df['ValueDiff'] = df['Value'].diff()
Date Value PercentageDiff ValueDiff
0 10/01/2020 1 NaN NaN
1 10/02/2020 2 100.0 1.0
2 10/03/2020 5 150.0 3.0
3 10/04/2020 8 60.0 3.0
Or you use df.assign
df.assign(
percentageDiff=df["Value"].pct_change().mul(100),
ValueDiff=df["Value"].diff()
)

Groupby and sum by 1 column, keep all other columns, and mutate a new column, counting summed rows with pandas

I am new to Python and can see at least 5 similar questions and this one is very close but non of them work for me.
I have a dataframe with non-unique customers.
customer_id amount male age income days reward difficulty duration
0 id_1 16.06 1 45 62000.0 608 2.0 10.0 10.0
1 id_1 18.00 1 45 62000.0 608 2.0 10.0 10.0
I am trying to group them by customer_id, sum by amount and keep all other columns PLUS add one column total, counting my transactions
Desired output
customer_id amount male age income days reward difficulty duration total
0 id_1 34.06 1 45 62000.0 608 2.0 10.0 10.0 2
My best personal attempt so far does not preserve all columns
groupby('customer_id')['amount'].agg(total_sum = 'sum', total = 'count')
You could do it this way, include all other columns in your groupby then reset_index after aggregating:
df.groupby(df.columns.difference(['amount']).tolist())['amount']\
.agg(total_sum='sum',total='count').reset_index()
Output:
age customer_id days difficulty duration income male reward total_sum total
0 45 id_1 608 10.0 10.0 62000.0 1 2.0 34.06 2
you could do:
grouper = df.groupby('customer_id')
first_dict = {col: 'first' for col in df.columns.difference(['customer_id', 'amount'])}
o = grouper.agg({
'amount': 'size',
**first_dict,
})
o['total'] = grouper.size().values
Based on #Scott Boston's answer, I found an answer myself too and I acknowledge that my solution is not elegant (maybe something will help to clean it). But it gives me an expanded solution, when I have non-unique rows (for instance, each customer_id has five different transactions).
df.groupby('customer_id').agg({'amount':['sum'], 'reward_':['sum'], 'difficulty':['mean'],
'duration':['mean'], 'male':['mean'], 'male':['mean'],
'income':['mean'], 'days':['mean'], 'age':['mean'],
'customer_id':['count']}).reset_index()
df_grouped = starbucks_grouped.droplevel(1, axis = 1)
My output is

How to create a new column with .size() values of other column in pandas?

df2 = df_cleaned.groupby('company').size()
df2.columns = ['company', 'frequency']
#df2.sort_values('frequency') # error : No axis named frequency for object type <class 'type'>
df2
I have a dataframe "df_cleaned" with a 'company' column and im trying to create a new dataframe "df2" with a extra 'frequency' column to check how many times each company has been mentioned. I am unable to create a new frequency column. Seems like i'm doing something wrong, please help me out.
Screenshot showing no frequency column
You don't provide the data for us, so generate it:
import numpy as np
source = ['3Com', '3M', 'A-T-O', 'A.H. Robins']
cmp = [source[i] for i in np.random.randint(4, size = 20)]
df = pd.DataFrame(cmp, columns = ['company'])
Out[1]:
company
0 A.H. Robins
1 3M
2 A.H. Robins
3 A.H. Robins
4 3M
5 3M
6 3Com
7 A-T-O
8 3Com
9 A-T-O
10 3M
11 3M
12 A-T-O
13 3M
14 3M
15 A.H. Robins
16 A-T-O
17 A-T-O
18 A-T-O
19 3Com
df.groupby('company')[['company']].count().rename(columns = {'company':'frequency'})
Out[2]:
frequency
company
3Com 3
3M 7
A-T-O 6
A.H. Robins 4
Use:
df2 = df_cleaned.groupby('company').size().to_frame('frecuency')

Categories

Resources