What is the most effective way to solve following problem with pandas?:
Let's asume we have following df:
v1 v2
index
0 1 2
1 5 6
2 7 3
3 9 4
4 5 1
Now we want to calculate a third value (v3) based on following function:
if df.v1.shift(1) > df.v3.shift(1):
df.v3 = max(df.v2, df.v3.shift(1))
else:
df.v3 = df.v2
The desired output should look like:
v1 v2 v3
index
0 1 2 2
1 5 6 6
2 7 3 3
3 9 4 4
4 5 1 4
THX & BR from Vienna
I believe the following two lines gets to your result:
df['v3'] = df['v2']
df['v3'] = df['v3'].where(df['v1'].shift(1)<=df['v3'].shift(1),pd.DataFrame([df['v2'],df['v3'].shift(1)]).max())
Related
I am trying to handle the following dataframe
df = pd.DataFrame({'ID':[1,1,2,2,3,3,3,4,4,4,4],
'sum':[1,2,1,2,1,2,3,1,2,3,4,]})
Now I want to find the difference from the last row by each ID.
Specifically, I tried this code.
df['diff'] = df.groupby('ID')['sum'].diff(-1)
df
However, this would require a difference from one line behind.
Is there any way to determine the difference between each of the last rows with groupbuy?
Thank you for your help.
You can use transform('last') to get the last value per group:
df['diff'] = df['sum'].sub(df.groupby('ID')['sum'].transform('last'))
or using groupby.apply:
df['diff'] = df.groupby('ID')['sum'].apply(lambda x: x-x.iloc[-1])
output:
ID sum diff
0 1 1 -1
1 1 2 0
2 2 1 -1
3 2 2 0
4 3 1 -2
5 3 2 -1
6 3 3 0
7 4 1 -3
8 4 2 -2
9 4 3 -1
10 4 4 0
I have the following DataFrame dt:
a
0 1
1 2
2 3
3 4
4 5
How do I create a a new column where each row is a function of previous rows?
For instance, say the formula is:
B_row(t) = A_row(t-1)+A_row(t-2)+3
Such that:
a b
0 1 /
1 2 /
2 3 6
3 4 8
4 5 10
Also, I hear a lot about the fact that we mustn't loop through rows in Pandas', however it seems to me that I should go at it by looping through each row and creating a sort of recursive loop - as I would do in regular Python.
You could use cumprod:
dt['b'] = dt['a'].cumprod()
Output:
a b
0 1 1
1 2 2
2 3 6
3 4 24
4 5 120
I have a csv file like this:
,,22-5-2021 (v_c) , 23-5-2021 (v_c)
col_a,col_b,v_c,v_d,v_c,v_d
1,1,2,4,5,6
2,2,2,3,7,6
3,3,2,5,6,5
I need to convert it to:
col_a,col_b,v_c,v_d,dates
1,1,2,4,22-5-2021
1,1,5,6,23-5-2021
2,2,2,3,22-5-2021
2,2,7,6,23-5-2021
3,3,2,5,22-5-2021
3,3,6,5,23-5-2021
or
col_a,col_b,v_c,v_d,dates
1,1,2,4,22-5-2021
2,2,2,3,22-5-2021
3,3,2,5,22-5-2021
1,1,5,6,23-5-2021
2,2,7,6,23-5-2021
3,3,6,5,23-5-2021
My approach was using df.melt, but didn't quite get it. Maybe I'm lost with how to bring dates that are for 2 columns each.
You can try via list comprehension+pd.wide_to_long():
df=pd.read_csv('etc.csv',header=1)
df.columns=[x if x.split('.')[-1].isnumeric() else x+'.0' for x in df]
df=(pd.wide_to_long(df,['v_c','v_d'],['col_a.0','col_b.0'],'drop',sep='.')
.reset_index().sort_values('drop'))
df['dates']=df.pop('drop').map({0:'22-5-2021',1:'23-5-2021'})
df.columns=df.columns.str.rstrip('.0')
output of df:
col_a col_b v_c v_d dates
0 1 1 2 4 22-5-2021
2 2 2 2 3 22-5-2021
4 3 3 2 5 22-5-2021
1 1 1 5 6 23-5-2021
3 2 2 7 6 23-5-2021
5 3 3 6 5 23-5-2021
given this df:
df = pd.DataFrame({"A":[1,0,0,0,0,1,0,1,0,0,1],'B':['enters','A','B','C','D','exit','walk','enters','Q','Q','exit'],"Value":[4,4,4,4,5,6,6,6,6,6,6]})
A B Value
0 1 enters 4
1 0 A 4
2 0 B 4
3 0 C 4
4 0 D 5
5 1 exit 6
6 0 walk 6
7 1 enters 6
8 0 Q 6
9 0 Q 6
10 1 exit 6
There are 2 'transactions' here. When someone enters and leaves. So tx#1 is between 0 and 5 and tx#2 between 7 and 10.
My goal is to show if the value was changed? So in tx1 the value has changed from 4 to 6 and in tx#2 no change. Expected result:
index tx value_before value_after
0 1 4 6
7 2 6 6
I tried to fill the 0 between each tx with 1 and then group but I get all A column as 1. I'm not sure how to define the group by if each tx stands on its own.
Assign a new transaction number on each new 'enter' and pivot:
df['tx'] = np.where(df.B.eq('enters'),1,0).cumsum()
df[df.B.isin(['enters','exit'])].pivot('tx','B','Value')
Result:
B enters exit
tx
1 4 6
2 6 6
Not exactly what you want but it has all the info
df[df['B'].isin(['enters', 'exit'])].drop(['A'], axis=1).reset_index()
index B Value
0 0 enters 4
1 5 exit 6
2 7 enters 6
3 10 exit 6
You can get what you need with cumsum(), and pivot_table():
df['tx'] = np.where(df['B']=='enters',1,0).cumsum()
res = pd.pivot_table(df[df['B'].isin(['enters','exit'])],
index=['tx'],columns=['B'],values='Value'
).reset_index().rename(columns ={'enters':'value_before','exit':'value_after'})
Which prints:
res
tx value_before value_after
0 1 4 6
1 2 6 6
If you always have a sequence "enters - exit" you can create a new dataframe and assign certain values to each column:
result = pd.DataFrame({'tx': [x + 1 for x in range(len(new_df['value_before']))],
'value_before': df['Value'].loc[df['B'] == 'enters'],
'value_after': list(df['Value'].loc[df['B'] == 'exit'])})
Output:
tx value_before value_after
0 1 4 6
7 2 6 6
You can add 'reset_index(drop=True)' at the end if you don't want to see an index from the original dataframe.
I added 'list' for 'value_after' to get right concatenation.
I have a dataframe like this:
source target weight
1 2 5
2 1 5
1 2 5
1 2 7
3 1 6
1 1 6
1 3 6
My goal is to remove the duplicate rows, but the order of source and target columns are not important. In fact, the order of two columns are not important and they should be removed. In this case, the expected result would be
source target weight
1 2 5
1 2 7
3 1 6
1 1 6
Is there any way to this without loops?
Use frozenset and duplicated
df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]
source target weight
0 1 2 5
3 3 1 6
4 1 1 6
If you want to account for unordered source/target and weight
df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6
However, to be explicit with more readable code.
# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')
# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()
df[~mask]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6
Should be fairly easy.
data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])
You can drop the duplicates using drop_duplicates
df = df.drop_duplicates(keep=False)
print(df)
would result in:
source target weight
1 2 1 5
3 3 1 6
4 1 1 6
5 1 3 6
because you want to handle the unordered source/target issue.
def pair(row):
sorted_pair = sorted([row['source'],row['target']])
row['source'] = sorted_pair[0]
row['target'] = sorted_pair[1]
return row
df = df.apply(pair,axis=1)
and then you can use df.drop_duplicates()
source target weight
0 1 2 5
3 1 2 7
4 1 3 6
5 1 1 6