Pandas: batch subsitution of values from different rows meeting same criterias

Pandas: batch subsitution of values from different rows meeting same criterias - python

I have extracted some data in pandas format from a sql server. The structure like this:
df = pd.DataFrame({'Day':(1,2,3,4,1,2,3,4),'State':('A','A','A','A','B','B','B','B'),'Direction':('N','S','N','S','N','S','N','S'),'values':(12,34,22,37,14,16,23,43)})
>>> df
Day Direction State values
0 1 N A 12
1 2 S A 34
2 3 N A 22
3 4 S A 37
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
Now I want to substitute all values with same day and same Direction but with (State == A) by itself + values with same day and same State but with (State == B). For example, like this:
df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'A'),'values'] = df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'A'),'values'].values + df.loc[(df.Day == 1) & (df.Direction == 'N') & (df.State == 'B'),'values'].values
>>> df
Day Direction State values
0 1 N A 26
1 2 S A 34
2 3 N A 22
3 4 S A 37
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43
Notice the first line values have been changed from 12 to 26(12 + 14)
Since the values are from different rows, so kind of difficult to use combine_first functions?
Now I have to use two loops (on 'Day' and on 'Direction') and the above attribution sentence to do, it's extremely slow when the dataframe's getting big. Do you have any smart and efficient way doing this?

You can first define a function to do add values from B to A in the same group. Then apply this function to each group.
def f(x):
x.loc[x.State=='A','values']+=x.loc[x.State=='B','values'].iloc[0]
return x
df.groupby(['Day','Direction']).apply(f)
Out[94]:
Day Direction State values
0 1 N A 26
1 2 S A 50
2 3 N A 45
3 4 S A 80
4 1 N B 14
5 2 S B 16
6 3 N B 23
7 4 S B 43

Related

Sample from dataframe with conditions

I have a large dataset and I want to sample from it but with a conditional. What I need is a new dataframe with the almost the same amount (count) of values of a boolean column of `0 and 1'
What I have:
df['target'].value_counts()
0 = 4000
1 = 120000
What I need:
new_df['target'].value_counts()
0 = 4000
1 = 6000
I know I can df.sample but I dont know how to insert the conditional.
Thanks

Since 1.1.0, you can use groupby.sample if you need the same number of rows for each group:
df.groupby('target').sample(4000)
Demo:
df = pd.DataFrame({'x': [0] * 10 + [1] * 25})
df.groupby('x').sample(5)
x
8 0
6 0
7 0
2 0
9 0
18 1
33 1
24 1
32 1
15 1
If you need to sample conditionally based on the group value, you can do:
df.groupby('target', group_keys=False).apply(
lambda g: g.sample(4000 if g.name == 0 else 6000)
)
Demo:
df.groupby('x', group_keys=False).apply(
lambda g: g.sample(4 if g.name == 0 else 6)
)
x
7 0
8 0
2 0
1 0
18 1
12 1
17 1
22 1
30 1
28 1

Assuming the following input and using the values 4/6 instead of 4000/6000:
df = pd.DataFrame({'target': [0,1,1,1,0,1,1,1,0,1,1,1,0,1,1,1]})
You could groupby your target and sample to take at most N values per group:
df.groupby('target', group_keys=False).apply(lambda g: g.sample(min(len(g), 6)))
example output:
target
4 0
0 0
8 0
12 0
10 1
14 1
1 1
7 1
11 1
13 1
If you want the same size you can simply use df.groupby('target').sample(n=4)

Is there a way to reference a previous value in Pandas column efficiently?

I want to do some complex calculations in pandas while referencing previous values (basically I'm calculating row by row). However the loops take forever and I wanted to know if there was a faster way. Everybody keeps mentioning using shift but I don't understand how that would even work.
df = pd.DataFrame(index=range(500)
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
for i in range(len(df):
if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25

numpy_ext can be used for expanding calculations
pandas-rolling-apply-using-multiple-columns for reference
I have also included a simpler calc to demonstrate behaviour in simpler way
df = pd.DataFrame(index=range(5000))
df["A"]= 2
df["B"]= 5
df["A"][0]= 1
import numpy_ext as npe
# for i in range(len(df):
# if i != 0: df['A'][i] = (df['A'][i-1] / 3) - df['B'][i-1] + 25
# SO example - function of previous values in A and B
def f(A,B):
r = np.sum(A[:-1]/3) - np.sum(B[:-1] + 25) if len(A)>1 else A[0]
return r
# much simpler example, sum of previous values
def g(A):
return np.sum(A[:-1])
df["AB_combo"] = npe.expanding_apply(f, 1, df["A"].values, df["B"].values)
df["A_running"] = npe.expanding_apply(g, 1, df["A"].values)
print(df.head(10).to_markdown())
sample output
A
B
AB_combo
A_running
0
1
5
1
0
1
2
5
-29.6667
1
2
2
5
-59
3
3
2
5
-88.3333
5
4
2
5
-117.667
7
5
2
5
-147
9
6
2
5
-176.333
11
7
2
5
-205.667
13
8
2
5
-235
15
9
2
5
-264.333
17

Unpacking error with multiple variables in for loop

I wish to extract a dataframe of numbers(floating) based on the first instance of position MrkA and Mrk1. I am not interested in the second instance of MrkA because I know what columns to extract via the line df1
Input:
df = pd.DataFrame({'A':['sdfg',23,'MrkA',34,0,56],'B':['jfgh',23,'sdfg','MrkB',0,56], 'C':['cvb',7,'dsfgA','ghks',47,3],'D':['rrb',7,'gfd',3,0,7],'E':['dfg',7,'gfd',5,12,1],'F':['dfg',7,'sdfA',5,0,4],'G':['dfg',7,'sdA',5,8,9],'H':['dfg',7,'gfA',5,0,8],'I':['dfg',7,'sdfA',5,7,23]})
A B C D E F G H I
0 sdfg jfgh cvb rrb dfg dfg dfg dfg dfg
1 23 23 7 7 7 7 7 7 7
2 MrkA sdfg dsfgA MrkA gfd sdfA sdA gfA sdfA
3 34 Mrk1 ghks 3 Mrk2 5 5 5 5
4 0 0 47 0 12 0 8 0 7
5 56 56 3 7 1 4 9 8 23
for i,j in range(df.shape[1]):
for k,l in range(df.shape[0]):
if df.iloc[k,i] == 'MrkA'and df.iloc[l,j] == 'Mrk1':
col = i
row = k
df1=df.iloc[row+2:,[col,col+1,col+2,col+4,col+5,col+7,col+8]]
break
Output: cannot unpack non-iterable int object
Desired Output:
A B C E F H I
4 0 0 47 12 0 0 7
5 56 56 3 1 4 8 23
How shall I proceed?

Your problem is that df.shape[0]/df.shape[1] is a single element. So trying to unpack range(value) to 2 indices is causing the error.
It should be:
for i in range(df.shape[1]):
for j in range(df.shape[0]):
Then you can apply the desired logic to extract the rows.
Note that it's unclear way you ignore the second row which is also all numeric. If it's only a typo you can try the following to extract all the fully numeric rows and apply some logic there:
df[df.applymap(np.isreal).all(1)]
Edit
Although it is not clear from your specific example what is the logic:
In the example you gave there is no Mrk1 but rather MrkB.
Why is D column disappeared?
A hard-coded example that gives the desired output should be something similar to the following:
import pandas as pd
df = pd.DataFrame({'A':['sdfg',23,'MrkA',34,0,56],'B':['jfgh',23,'sdfg','MrkB',0,56], 'C':['cvb',7,'dsfgA','ghks',47,3],'D':['rrb',7,'gfd',3,0,7],'E':['dfg',7,'gfd',5,12,1],'F':['dfg',7,'sdfA',5,0,4],'G':['dfg',7,'sdA',5,8,9],'H':['dfg',7,'gfA',5,0,8],'I':['dfg',7,'sdfA',5,7,23]})
for r in range(0, df.shape[0] - 1):
for c in range(df.shape[1] - 1):
if df.iloc[r, c] == "MrkA" and df.iloc[r + 1, c + 1] == "MrkB":
print(df.iloc[r + 2:, :])
This gives:
A B C D E F G H I
4 0 0 47 0 12 0 8 0 7
5 56 56 3 7 1 4 9 8 23

Drop rows if value in column changes

Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]

my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

Groupby on condition and calculate sum of subgroups

Here is my data:
import numpy as np
import pandas as pd
z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})
z
a b c
0 1 3 10
1 1 4 11
2 1 5 12
3 2 6 13
4 2 7 14
5 3 8 15
6 3 9 16
Question:
How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all.
Here is the code I wrote: (It runs but I cannot get the correct result)
gbz = z.groupby('a')
# For displaying the groups:
gbz.apply(lambda x: print(x))
list = []
def f(x):
list_new = []
for row in range(0,len(x)):
if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9):
list_new.append(x.iloc[row,1])
list.append(sum(list_new))
results = gbz.apply(f)
The output result should be something like this:
a c
0 1 12
1 2 27
2 3 15

It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby.
z.query('4 < b < 9').groupby('a', as_index=False).c.sum()
which yields
a c
0 1 12
1 2 27
2 3 15

Use
In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum()
Out[2379]:
a c
0 1 12
1 2 27
2 3 15
Or
In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum()
Out[2384]:
a c
0 1 12
1 2 27
2 3 15

You could also groupby first.
z = z.groupby('a').apply(lambda x: x.loc[x['b']\
.between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c')
z
a c
0 1 12
1 2 27
2 3 15

Or you can use
z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\
.reset_index(name='c')
Out[775]:
a c
0 1 12
1 2 27
2 3 15

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: batch subsitution of values from different rows meeting same criterias - python

Related

Sample from dataframe with conditions

Is there a way to reference a previous value in Pandas column efficiently?

Unpacking error with multiple variables in for loop

Drop rows if value in column changes

Groupby on condition and calculate sum of subgroups

Categories

Resources