Extract dataframe from dataframe based on column value with extended bounds

Extract dataframe from dataframe based on column value with extended bounds - python

I have the following dataframe:
NAME
SIGNAL
a
0
b
0
c
0
d
0
e
1
f
1
g
1
h
0
i
0
j
0
k
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
t
0
I need to write a function that will allow me to extract another dataframe, or just modify the existing frame based on a condition:
Get all columns (in my case NAME) if SIGNAL column is 1 for the row but also extract 2 rows extra from above and 2 rows extra from bellow.
In my example, the function should return me the following table:
NAME
SIGNAL
c
0
d
0
e
1
f
1
g
1
h
0
i
0
j
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
Thanks!
UPDATE:
This is the code I have so far:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['a', 0], ['b', 0], ['c', 1], ['d', 1], ['e', 0], ['f', 0], ['g', 0], ['h', 1], ['i', 0], ['j', 0], ['k', 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['NAME', 'SIGNAL'])
# print dataframe.
print(df)
print("----------------")
for index, row in df.iterrows():
#print(row['Name'], row['Age'])
if((df.iloc[index]['SIGNAL'] == 1) & (df.iloc[index-1]['SIGNAL'] == 0)): #check when the signal change from 0 to 1
print(df.iloc[index]['NAME']) #first line with signal 1 after it was 0
#print the above 2 lines
print(df.iloc[index-1]['NAME'])
print(df.iloc[index-2]['NAME'])
My dataframe is like:
NAME SIGNAL
0 a 0
1 b 0
2 c 1
3 d 1
4 e 0
5 f 0
6 g 0
7 h 1
8 i 0
9 j 0
10 k 0
My code is returning:
c
b
a
h
g
f
The problem here is that I cannot return the value of "d" and "e" + "f" or "i" and "j" because i get the error "IndexError: single positional indexer is out-of-bounds" if i try if condition:
(df.iloc[index]['SIGNAL'] == 1) & (df.iloc[index+1]['SIGNAL'] == 0)
enter code here
Also the extended bounds will be variable, sometimes I will work with 2 extra rows from top and bottom sometimes with more.
I am looking for a solution using dataframes functions and not iteration.
thanks!

This will return the desired data frame:
df[(df.shift(periods=-2, axis="rows").SIGNAL == 1) | (df.shift(periods=-1, axis="rows").SIGNAL == 1) | (df.SIGNAL == 1) | (df.shift(periods=1, axis="rows").SIGNAL == 1) | (df.shift(periods=2, axis="rows").SIGNAL == 1)]
Output:
NAME
SIGNAL
c
0
d
0
e
1
f
1
g
1
h
0
i
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
Add .NAME to the end to get your series of names
2 c
3 d
4 e
5 f
6 g
7 h
8 i
11 l
12 m
13 n
14 o
15 p
16 q
17 r
18 s
Name: NAME, dtype: object
Update: for arbitrarily large span
m=(df.shift(periods=-400, axis="rows").SIGNAL == 1)
for i in list(range(-399,401)):
m= m | (df.shift(periods=i, axis="rows").SIGNAL == 1)
print(df[m])
Disclaimer:
This method may be inefficient for large spans

Related

assign a shorter column to a longer column, given a sliding winow

I have this dataframe df1 of 8 rows:
ID
A
B
C
D
E
F
G
H
And I have this array arr of size 4 [-1, 0, 1, 2], and an m = 2, so I want to assign the values of this array jumping m times to df1, so I can have eventually:
ID N
A -1
B -1
C 0
D 0
E 1
F 1
G 2
H 2
How to do that in Python?

df1=pd.DataFrame({'ID':['A','B', 'C', 'D', 'E', 'F', 'G', 'H']})
arr=[-1,0,1,2]
m=2
If arr should be repeated again and again:
df1['N']=(arr*m)[:len(df1)]
Result:
ID
N
0
A
-1
1
B
0
2
C
1
3
D
2
4
E
-1
If each element should be repeated:
df1['N']=[arr[i] for i in range(len(arr)) for j in range(m)][:len(df1)]
Result:
ID
N
0
A
-1
1
B
-1
2
C
0
3
D
0
4
E
1

~~ Without numpy
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = sum([[x]*m for x in arr], [])
~~ With Numpy
import numpy as np
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = np.repeat(arr, m)
Output:
ID N
0 A -1
1 B -1
2 C 0
3 D 0
4 E 1
5 F 1
6 G 2
7 H 2

Replace contents of cell with another cell if condition on a separate cell is met

I have to following data frame
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
we can see 4 columns A,B,C,D the intended outcome is to replace the contents of B with the contents of D, if a condition on C is met, for this example the condition is of C = 1
the intended output is
A = [1,2,5,4,3,1]
B = ["y","No","y","y","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
output_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
output_df.drop('D', axis = 1)
What is the best way to apply this logic to a data frame?

There are many ways to solve, here is another one:
test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

This can be done with np.where:
test_df['B'] = np.where(test_df['C']==1, test_df['D'], test_df['B'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

The desired output is achieved using .loc with column 'C' as the mask.
test_df.loc[test_df['C']==1,'B'] = test_df.loc[test_df['C']==1,'D']
UPDATE: Just found out a similar answer is posted by #QuangHoang. This answer is slightly different in that it does not require numpy

I don't know if inverse is the right word here, but I noticed recently that mask and where are "inverses" of each other. If you pass a ~ to the condition of a .where statement, then you get the same result as mask:
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
test_df['B'] = test_df['B'].where(~(test_df['C'] == 1), test_df['D'])
# test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D']) - Scott Boston's answer
test_df
Out[1]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

You can also use df.where:
test_df['B'] = test_df['D'].where(test_df.C.eq(1), test_df.B)
Output:
In [875]: test_df
Out[875]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

Checking Value from Specific Column of dataframe and updating values from an array to Column 2

I have dataframe with 2 columns in it Column A and Column B and an array of alphabets from A to P which are as follows
df = pd.DataFrame({
'Column_A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B':[]
})
the array is as follows:
label = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P']
Expected output is
'A':[0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'B':['A','A','A','A','A','E','E','E','E','E','I','I','I','I','I','M']
Value from Column B changes as soon as value from Column A is 1. and the value is taken from the given array 'label'
I have tried using this for loop
for row in df.index:
try:
if df.loc[row,'Column_A'] == 1:
df.at[row, 'Column_B'] = label[row+4]
print(label[row])
else:
df.ColumnB.fillna('ffill')
except IndexError:
row = (row+4)%4
df.at[row, 'Coumn_B'] = label[row]
I also want to loopback if it reaches the last value in 'Label' Array.

Some solution that should do the trick looks like:
label=list('ABCDEFGHIJKLMNOP')
df = pd.DataFrame({
'Column_A': [0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1],
'Column_B': label
})
Not exactly sure, what you intended with the fillna, because I think you don't need it.
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row in df.index:
if df.loc[row,'Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])
I also avoid the exception handling in this case, because the "index overflow" can be handled without exception handling.
Btw. if you have a large dataframe you can probably make the code faster by eliminating one lookup (but you'd need to verify if it really runs faster). The solution would look like this then:
max_index= len(label)
df['Column_B']='ffill'
lookup= 0
for row, record in df.iterrows():
if record['Column_A'] == 1:
lookup= lookup+4 if lookup+4 < max_index else lookup%4
df.at[row, 'Column_B'] = label[lookup]
print(label[row])

Option 1
cond1 = df.Column_A == 1
cond2 = df.index == 0
mappr = lambda x: label[x]
df.assign(Column_B=np.where(cond1 | cond2, df.index.map(mappr), np.nan)).ffill()
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P
Option 2
a = np.append(0, np.flatnonzero(df.Column_A))
b = df.Column_A.to_numpy().cumsum()
c = np.array(label)
df.assign(Column_B=c[a[b]])
Column_A Column_B
0 0 A
1 0 A
2 0 A
3 0 A
4 0 A
5 1 F
6 0 F
7 0 F
8 0 F
9 0 F
10 1 K
11 0 K
12 0 K
13 0 K
14 0 K
15 1 P

Using groupby with transform then map
df.reset_index().groupby(df.Column_A.eq(1).cumsum())['index'].transform('first').map(dict(enumerate(label)))
Out[139]:
0 A
1 A
2 A
3 A
4 A
5 F
6 F
7 F
8 F
9 F
10 K
11 K
12 K
13 K
14 K
15 P
Name: index, dtype: object

Generate summary for each row based on column values

i have data frame like this.
import pandas as pd
raw_data = {'Sub1':['A','B','C','D','E'],
'Sub2':['F','G','H','I','J'],
'Sub3':['K','L','M','N','O'],
'S_score1': [1, 0, 0, 6,0],
'S_score2': [0, 1, 0, 6,0],
'S_score3': [0, 1, 0, 6,0],
}
df2 = pd.DataFrame(raw_data, columns = ['Sub1','Sub2','Sub3','S_score1', 'S_score2', 'S_score3'])
Have data frame
i wants to check score columns and check if the score is greater than 1 then take the respective subject in text.
Wanted Output:

First, separate out the grade columns from the one hot columns.
u = df2.filter(like='Sub')
v = df2.filter(like='S_score').astype(bool)
Next, aggregate letter grades by multiplication, and set column values.
r = (u.mul(v.values)
.agg(','.join, axis=1)
.str.strip(',')
.str.replace(',{2,}', ','))
df2['s_text'] = np.where(r.str.len() > 0, 'You scored ' + r, 'N/A')
df2
Sub1 Sub2 Sub3 S_score1 S_score2 S_score3 s_text
0 A F K 1 0 0 You scored A
1 B G L 0 1 1 You scored G,L
2 C H M 0 0 0 N/A
3 D I N 6 6 6 You scored D,I,N
4 E J O 0 0 0 N/A
To make the last separator different, you will need a custom function.
def join(lst):
lst = lst[lst != '']
if len(lst) > 1:
return 'You scored ' + ', '.join(lst[:-1]) + ' and ' + lst[-1]
elif len(lst) > 0:
return 'You scored ' + ', '.join(lst)
return 'N/A'
df2['s_text'] = u.mul(v.values).agg(join, axis=1)
df2
Sub1 Sub2 Sub3 S_score1 S_score2 S_score3 s_text
0 A F K 1 0 0 You scored A
1 B G L 0 1 1 You scored G and L
2 C H M 0 0 0 N/A
3 D I N 6 6 6 You scored D, I and N
4 E J O 0 0 0 N/A

Doing with join after multiple
s=(df2.filter(like='Sub')*df2.filter(like='S_').ge(1).values).apply( lambda x : ','.join([y for y in x if y is not '']),axis=1)
s
Out[324]:
0 A
1 G,L
2
3 D,I,N
4
dtype: object
Then chain with np.where
np.where(s=='','You do not have score','You have'+s)
Out[326]:
array(['You haveA', 'You haveG,L', 'You do not have score',
'You haveD,I,N', 'You do not have score'], dtype=object)
#Assign it back
df2['s_txt']=np.where(s=='','You do not have score','You have'+s)
df2
Out[328]:
Sub1 Sub2 ... S_score3 s_txt
0 A F ... 0 You haveA
1 B G ... 1 You haveG,L
2 C H ... 0 You do not have score
3 D I ... 6 You haveD,I,N
4 E J ... 0 You do not have score
[5 rows x 7 columns]

One of possible solutions consists of the following steps:
Define a function generating the output text for a source row.
This function should join source columns filtered for not null.
Generate subs table containing Sub1, Sub2 and Sub3.
Generate msk (mask) table containing S_score... columns and
change column names to Sub1, Sub2 and Sub3.
Compute subs.where(msk) and apply the above function to each row.
Note that for False elements in mask, the respective output element
is None, so the function applied will not include it in join.
So the whole script can look like below:
def txt(x):
tbl = list(filter(lambda elem: not pd.isnull(elem), x))
if len(tbl) > 0:
return 'You have scored on ' + ', '.join(tbl)
else:
return 'You have not scored any subject'
subs = df.loc[:, :'Sub3']
msk = df.loc[:, 'S_score1':] > 0
msk.columns = ['Sub1', 'Sub2', 'Sub3']
df['s_text'] = subs.where(msk).apply(txt, axis=1)

keyerror after removing nans in pandas

I am reading a file with pd.read_csv and removing all the values that are -1. Here's the code
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C', 'D']
catalog = pd.read_csv('data.txt', sep='\s+', names=columns, skiprows=1)
a = cataog['A']
b = cataog['B']
c = cataog['C']
d = cataog['D']
print len(b) # answer is 700
# remove rows that are -1 in column b
idx = np.where(b != -1)[0]
a = a[idx]
b = b[idx]
c = c[idx]
d = d[idx]
print len(b) # answer is 612
So I am assuming that I have successfully managed to remove all the rows where the value in column b is -1.
In order to test this, I am doing the following naive way:
for i in range(len(b)):
print i, a[i], b[i]
It prints out the values until it reaches a row which was supposedly filtered out. But now it gives a KeyError.

You can filtering by boolean indexing:
catalog = catalog[catalog['B'] != -1]
a = cataog['A']
b = cataog['B']
c = cataog['C']
d = cataog['D']
It is expected you get KeyError, because index values not match, because filtering.
One possible solution is convert Series to lists:
for i in range(len(b)):
print i, list(a)[i], list(b)[i]
Sample:
catalog = pd.DataFrame({'A':list('abcdef'),
'B':[-1,5,4,5,-1,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0]})
print (catalog)
A B C D
0 a -1 7 1
1 b 5 8 3
2 c 4 9 5
3 d 5 4 7
4 e -1 2 1
#filtered DataFrame have no index 0, 4
catalog = catalog[catalog['B'] != -1]
print (catalog)
A B C D
1 b 5 8 3
2 c 4 9 5
3 d 5 4 7
5 f 4 3 0
a = catalog['A']
b = catalog['B']
c = catalog['C']
d = catalog['D']
print (b)
1 5
2 4
3 5
5 4
Name: B, dtype: int64
#a[i] in first loop want match index value 0 (a[0]) what does not exist, so KeyError,
#same problem for b[0]
for i in range(len(b)):
print (i, a[i], b[i])
KeyError: 0
#convert Series to list, so list(a)[0] return first value of list - there is no Series index
for i in range(len(b)):
print (i, list(a)[i], list(b)[i])
0 b 5
1 c 4
2 d 5
3 f 4
Another solution should be create default index 0,1,... by reset_index with drop=True:
catalog = catalog[catalog['B'] != -1].reset_index(drop=True)
print (catalog)
A B C D
0 b 5 8 3
1 c 4 9 5
2 d 5 4 7
3 f 4 3 0
a = catalog['A']
b = catalog['B']
c = catalog['C']
d = catalog['D']
#default index values match a[0] and a[b]
for i in range(len(b)):
print (i, a[i], b[i])
0 b 5
1 c 4
2 d 5
3 f 4

If you filter out indices, then
for i in range(len(b)):
print i, a[i], b[i]
will attempt to access erased indices. Instead, you can use the following:
for i, ae, be in zip(a.index, a.values, b.values):
print(i, ae, be)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract dataframe from dataframe based on column value with extended bounds - python

Related

assign a shorter column to a longer column, given a sliding winow

Replace contents of cell with another cell if condition on a separate cell is met

Checking Value from Specific Column of dataframe and updating values from an array to Column 2

Generate summary for each row based on column values

keyerror after removing nans in pandas

Categories

Resources