check if value is changed between group of transactions

check if value is changed between group of transactions - python

given this df:
df = pd.DataFrame({"A":[1,0,0,0,0,1,0,1,0,0,1],'B':['enters','A','B','C','D','exit','walk','enters','Q','Q','exit'],"Value":[4,4,4,4,5,6,6,6,6,6,6]})
A B Value
0 1 enters 4
1 0 A 4
2 0 B 4
3 0 C 4
4 0 D 5
5 1 exit 6
6 0 walk 6
7 1 enters 6
8 0 Q 6
9 0 Q 6
10 1 exit 6
There are 2 'transactions' here. When someone enters and leaves. So tx#1 is between 0 and 5 and tx#2 between 7 and 10.
My goal is to show if the value was changed? So in tx1 the value has changed from 4 to 6 and in tx#2 no change. Expected result:
index tx value_before value_after
0 1 4 6
7 2 6 6
I tried to fill the 0 between each tx with 1 and then group but I get all A column as 1. I'm not sure how to define the group by if each tx stands on its own.

Assign a new transaction number on each new 'enter' and pivot:
df['tx'] = np.where(df.B.eq('enters'),1,0).cumsum()
df[df.B.isin(['enters','exit'])].pivot('tx','B','Value')
Result:
B enters exit
tx
1 4 6
2 6 6

Not exactly what you want but it has all the info
df[df['B'].isin(['enters', 'exit'])].drop(['A'], axis=1).reset_index()
index B Value
0 0 enters 4
1 5 exit 6
2 7 enters 6
3 10 exit 6

You can get what you need with cumsum(), and pivot_table():
df['tx'] = np.where(df['B']=='enters',1,0).cumsum()
res = pd.pivot_table(df[df['B'].isin(['enters','exit'])],
index=['tx'],columns=['B'],values='Value'
).reset_index().rename(columns ={'enters':'value_before','exit':'value_after'})
Which prints:
res
tx value_before value_after
0 1 4 6
1 2 6 6

If you always have a sequence "enters - exit" you can create a new dataframe and assign certain values to each column:
result = pd.DataFrame({'tx': [x + 1 for x in range(len(new_df['value_before']))],
'value_before': df['Value'].loc[df['B'] == 'enters'],
'value_after': list(df['Value'].loc[df['B'] == 'exit'])})
Output:
tx value_before value_after
0 1 4 6
7 2 6 6
You can add 'reset_index(drop=True)' at the end if you don't want to see an index from the original dataframe.
I added 'list' for 'value_after' to get right concatenation.

Related

For loop to assign value to variable based on value of another var

Yes, Hello coders
I'm tring to assign value to the variable within the data frame based on another variable
my data looks like:
Housing_ID Member_ID My_new_staus
1 1
1 2
1 3
1 4
1 5
2 1
2 2
3 1
3 2
3 3
what i want to assign is where the housing id equals to 1 (which is repeated) put the My_new_staus: "Valid"
I tried to apply it through this code:
for i in range (len(df['Housing_ID'])):
if df['Housing_ID'][i] == 1 :
df['My_new_staus'][i] = 'Valid'
else:
df['My_new_staus'][i] = ''
and got this message : # Similar to Index.get_value, but we do not fall back to positional
KeyError: 398
the output that i want is
Housing_ID Member_ID My_new_staus
1 1 Valid
1 2 Valid
1 3 Valid
1 4 Valid
1 5 Valid
2 1
2 2
3 1
3 2
3 3

You can use np.where to assign values to 'My_new_status':
df['My_new_status'] = np.where(df['Housing_ID']==1,'valid','')
Output:
Housing_ID Member_ID My_new_status
0 1 1 valid
1 1 2 valid
2 1 3 valid
3 1 4 valid
4 1 5 valid
5 2 1
6 2 2
7 3 1
8 3 2
9 3 3

In general, you should avoid iterating through dataframes. You can just use an .apply():
df = \
df.assign(My_new_status = df.Member_ID.apply(lambda row: 'Valid' if x==1 else ''))

how remove rows in a dataframe that the order of values are not important

I have a dataframe like this:
source target weight
1 2 5
2 1 5
1 2 5
1 2 7
3 1 6
1 1 6
1 3 6
My goal is to remove the duplicate rows, but the order of source and target columns are not important. In fact, the order of two columns are not important and they should be removed. In this case, the expected result would be
source target weight
1 2 5
1 2 7
3 1 6
1 1 6
Is there any way to this without loops?

Use frozenset and duplicated
df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]
source target weight
0 1 2 5
3 3 1 6
4 1 1 6
If you want to account for unordered source/target and weight
df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6
However, to be explicit with more readable code.
# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')
# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()
df[~mask]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6

Should be fairly easy.
data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])
You can drop the duplicates using drop_duplicates
df = df.drop_duplicates(keep=False)
print(df)
would result in:
source target weight
1 2 1 5
3 3 1 6
4 1 1 6
5 1 3 6
because you want to handle the unordered source/target issue.
def pair(row):
sorted_pair = sorted([row['source'],row['target']])
row['source'] = sorted_pair[0]
row['target'] = sorted_pair[1]
return row
df = df.apply(pair,axis=1)
and then you can use df.drop_duplicates()
source target weight
0 1 2 5
3 1 2 7
4 1 3 6
5 1 1 6

How to do for loops with conditions in python data frame

I am currently trying to add 1 to an entire column if the value(int) is greater than 0. The code that I am currently using for it is like so:`
for coldcloudy in final.coldcloudy:
final.loc[final['coldcloudy'] > 0,coldcloudy] +=1
However I keep on getting a 'KeyError: 0' with it. Essentially, I want the code to go row by row in a particular column and add 1 if the integer is zero. for the values that are added by 1, I will add to another column. Can someone please help?

You don't need for loop:
final = pd.DataFrame({'coldcloudy':np.random.choice([0,1],20)})
final.loc[final.coldcloudy > 0, 'coldcloudy'] += 1
print(final)
Output:
coldcloudy
0 2
1 2
2 0
3 0
4 2
5 2
6 0
7 2
8 0
9 0
10 2
11 2
12 0
13 2
14 2
15 0
16 2
17 0
18 2
19 2

Pandas - calculate row value based on previous calculated row value

What is the most effective way to solve following problem with pandas?:
Let's asume we have following df:
v1 v2
index
0 1 2
1 5 6
2 7 3
3 9 4
4 5 1
Now we want to calculate a third value (v3) based on following function:
if df.v1.shift(1) > df.v3.shift(1):
df.v3 = max(df.v2, df.v3.shift(1))
else:
df.v3 = df.v2
The desired output should look like:
v1 v2 v3
index
0 1 2 2
1 5 6 6
2 7 3 3
3 9 4 4
4 5 1 4
THX & BR from Vienna

I believe the following two lines gets to your result:
df['v3'] = df['v2']
df['v3'] = df['v3'].where(df['v1'].shift(1)<=df['v3'].shift(1),pd.DataFrame([df['v2'],df['v3'].shift(1)]).max())

pandas dataframe apply using additional arguments

with below example:
df = pd.DataFrame({'signal':[1,0,0,1,0,0,0,0,1,0,0,1,0,0],'product':['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],'price':[1,2,3,4,5,6,7,1,2,3,4,5,6,7],'price2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2]})
I have a function "fill_price" to create a new column 'Price_B' based on 'signal' and 'price'. For every 'product' subgroup, Price_B equals to Price if 'signal' is 1. Price_B equals previous row's Price_B if signal is 0. If the subgroup starts with a 0 'signal', then 'price_B' will be kept at 0 until 'signal' turns 1.
Currently I have:
def fill_price(df, signal,price_A):
p = df[price_A].where(df[signal] == 1)
return p.ffill().fillna(0).astype(df[price_A].dtype)
this is then applied using:
df['Price_B'] = fill_price(df,'signal','price')
However, I want to use df.groupby('product').apply() to apply this fill_price function to two subsets of 'product' columns separately, and also apply it to both'price' and 'price2' columns. Could someone help with that?
I basically want to do:
df.groupby('product',groupby_keys=False).apply(fill_price, 'signal','price2')

IIUC, you can use this syntax:
df['Price_B'] = df.groupby('product').apply(lambda x: fill_price(x,'signal','price2')).reset_index(level=0, drop=True)
Output:
price price2 product signal Price_B
0 1 1 A 1 1
1 2 2 A 0 1
2 3 1 A 0 1
3 4 2 A 1 2
4 5 1 A 0 2
5 6 2 A 0 2
6 7 1 A 0 2
7 1 2 B 0 0
8 2 1 B 1 1
9 3 2 B 0 1
10 4 1 B 0 1
11 5 2 B 1 2
12 6 1 B 0 2
13 7 2 B 0 2
You can write this much simplier without the extra function.
df['Price_B'] = (df.groupby('product',as_index=False)
.apply(lambda x: x['price2'].where(x.signal==1).ffill().fillna(0))
.reset_index(level=0, drop=True))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

check if value is changed between group of transactions - python

Assign a new transaction number on each new 'enter' and pivot: df['tx'] = np.where(df.B.eq('enters'),1,0).cumsum() df[df.B.isin(['enters','exit'])].pivot('tx','B','Value') Result: B enters exit tx 1 4 6 2 6 6

Not exactly what you want but it has all the info df[df['B'].isin(['enters', 'exit'])].drop(['A'], axis=1).reset_index() index B Value 0 0 enters 4 1 5 exit 6 2 7 enters 6 3 10 exit 6

Related

For loop to assign value to variable based on value of another var

how remove rows in a dataframe that the order of values are not important

How to do for loops with conditions in python data frame

Pandas - calculate row value based on previous calculated row value

pandas dataframe apply using additional arguments

Categories

Resources