keyerror after removing nans in pandas - python

I am reading a file with pd.read_csv and removing all the values that are -1. Here's the code
import pandas as pd
import numpy as np
columns = ['A', 'B', 'C', 'D']
catalog = pd.read_csv('data.txt', sep='\s+', names=columns, skiprows=1)
a = cataog['A']
b = cataog['B']
c = cataog['C']
d = cataog['D']
print len(b) # answer is 700
# remove rows that are -1 in column b
idx = np.where(b != -1)[0]
a = a[idx]
b = b[idx]
c = c[idx]
d = d[idx]
print len(b) # answer is 612
So I am assuming that I have successfully managed to remove all the rows where the value in column b is -1.
In order to test this, I am doing the following naive way:
for i in range(len(b)):
print i, a[i], b[i]
It prints out the values until it reaches a row which was supposedly filtered out. But now it gives a KeyError.

You can filtering by boolean indexing:
catalog = catalog[catalog['B'] != -1]
a = cataog['A']
b = cataog['B']
c = cataog['C']
d = cataog['D']
It is expected you get KeyError, because index values not match, because filtering.
One possible solution is convert Series to lists:
for i in range(len(b)):
print i, list(a)[i], list(b)[i]
Sample:
catalog = pd.DataFrame({'A':list('abcdef'),
'B':[-1,5,4,5,-1,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0]})
print (catalog)
A B C D
0 a -1 7 1
1 b 5 8 3
2 c 4 9 5
3 d 5 4 7
4 e -1 2 1
#filtered DataFrame have no index 0, 4
catalog = catalog[catalog['B'] != -1]
print (catalog)
A B C D
1 b 5 8 3
2 c 4 9 5
3 d 5 4 7
5 f 4 3 0
a = catalog['A']
b = catalog['B']
c = catalog['C']
d = catalog['D']
print (b)
1 5
2 4
3 5
5 4
Name: B, dtype: int64
#a[i] in first loop want match index value 0 (a[0]) what does not exist, so KeyError,
#same problem for b[0]
for i in range(len(b)):
print (i, a[i], b[i])
KeyError: 0
#convert Series to list, so list(a)[0] return first value of list - there is no Series index
for i in range(len(b)):
print (i, list(a)[i], list(b)[i])
0 b 5
1 c 4
2 d 5
3 f 4
Another solution should be create default index 0,1,... by reset_index with drop=True:
catalog = catalog[catalog['B'] != -1].reset_index(drop=True)
print (catalog)
A B C D
0 b 5 8 3
1 c 4 9 5
2 d 5 4 7
3 f 4 3 0
a = catalog['A']
b = catalog['B']
c = catalog['C']
d = catalog['D']
#default index values match a[0] and a[b]
for i in range(len(b)):
print (i, a[i], b[i])
0 b 5
1 c 4
2 d 5
3 f 4

If you filter out indices, then
for i in range(len(b)):
print i, a[i], b[i]
will attempt to access erased indices. Instead, you can use the following:
for i, ae, be in zip(a.index, a.values, b.values):
print(i, ae, be)

Related

assign a shorter column to a longer column, given a sliding winow

I have this dataframe df1 of 8 rows:
ID
A
B
C
D
E
F
G
H
And I have this array arr of size 4 [-1, 0, 1, 2], and an m = 2, so I want to assign the values of this array jumping m times to df1, so I can have eventually:
ID N
A -1
B -1
C 0
D 0
E 1
F 1
G 2
H 2
How to do that in Python?
df1=pd.DataFrame({'ID':['A','B', 'C', 'D', 'E', 'F', 'G', 'H']})
arr=[-1,0,1,2]
m=2
If arr should be repeated again and again:
df1['N']=(arr*m)[:len(df1)]
Result:
ID
N
0
A
-1
1
B
0
2
C
1
3
D
2
4
E
-1
If each element should be repeated:
df1['N']=[arr[i] for i in range(len(arr)) for j in range(m)][:len(df1)]
Result:
ID
N
0
A
-1
1
B
-1
2
C
0
3
D
0
4
E
1
~~ Without numpy
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = sum([[x]*m for x in arr], [])
~~ With Numpy
import numpy as np
arr = [-1, 0, 1, 2]
m = 2
df1["N"] = np.repeat(arr, m)
Output:
ID N
0 A -1
1 B -1
2 C 0
3 D 0
4 E 1
5 F 1
6 G 2
7 H 2

Extract dataframe from dataframe based on column value with extended bounds

I have the following dataframe:
NAME
SIGNAL
a
0
b
0
c
0
d
0
e
1
f
1
g
1
h
0
i
0
j
0
k
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
t
0
I need to write a function that will allow me to extract another dataframe, or just modify the existing frame based on a condition:
Get all columns (in my case NAME) if SIGNAL column is 1 for the row but also extract 2 rows extra from above and 2 rows extra from bellow.
In my example, the function should return me the following table:
NAME
SIGNAL
c
0
d
0
e
1
f
1
g
1
h
0
i
0
j
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
Thanks!
UPDATE:
This is the code I have so far:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['a', 0], ['b', 0], ['c', 1], ['d', 1], ['e', 0], ['f', 0], ['g', 0], ['h', 1], ['i', 0], ['j', 0], ['k', 0]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['NAME', 'SIGNAL'])
# print dataframe.
print(df)
print("----------------")
for index, row in df.iterrows():
#print(row['Name'], row['Age'])
if((df.iloc[index]['SIGNAL'] == 1) & (df.iloc[index-1]['SIGNAL'] == 0)): #check when the signal change from 0 to 1
print(df.iloc[index]['NAME']) #first line with signal 1 after it was 0
#print the above 2 lines
print(df.iloc[index-1]['NAME'])
print(df.iloc[index-2]['NAME'])
My dataframe is like:
NAME SIGNAL
0 a 0
1 b 0
2 c 1
3 d 1
4 e 0
5 f 0
6 g 0
7 h 1
8 i 0
9 j 0
10 k 0
My code is returning:
c
b
a
h
g
f
The problem here is that I cannot return the value of "d" and "e" + "f" or "i" and "j" because i get the error "IndexError: single positional indexer is out-of-bounds" if i try if condition:
(df.iloc[index]['SIGNAL'] == 1) & (df.iloc[index+1]['SIGNAL'] == 0)
enter code here
Also the extended bounds will be variable, sometimes I will work with 2 extra rows from top and bottom sometimes with more.
I am looking for a solution using dataframes functions and not iteration.
thanks!
This will return the desired data frame:
df[(df.shift(periods=-2, axis="rows").SIGNAL == 1) | (df.shift(periods=-1, axis="rows").SIGNAL == 1) | (df.SIGNAL == 1) | (df.shift(periods=1, axis="rows").SIGNAL == 1) | (df.shift(periods=2, axis="rows").SIGNAL == 1)]
Output:
NAME
SIGNAL
c
0
d
0
e
1
f
1
g
1
h
0
i
0
l
0
m
0
n
1
o
1
p
1
q
1
r
0
s
0
Add .NAME to the end to get your series of names
2 c
3 d
4 e
5 f
6 g
7 h
8 i
11 l
12 m
13 n
14 o
15 p
16 q
17 r
18 s
Name: NAME, dtype: object
Update: for arbitrarily large span
m=(df.shift(periods=-400, axis="rows").SIGNAL == 1)
for i in list(range(-399,401)):
m= m | (df.shift(periods=i, axis="rows").SIGNAL == 1)
print(df[m])
Disclaimer:
This method may be inefficient for large spans

Is there a good way to apply a function cumulatively to a pandas series of strings?

I have a Pandas data frame like this
x y
0 0 a
1 0 b
2 0 c
3 0 d
4 1 e
5 1 f
6 1 g
7 1 h
what I want to do is for each value of x to create a series which cumulatively concatenates the strings which have already appeared in y for that value of x. In other words, I want to get a Pandas series like this.
0
1 a,
2 a,b,
3 a,b,c,
4
5 e,
6 e,f,
7 e,f,g,
I can do it using a double for loop:
dat = pd.DataFrame({'x': [0, 0, 0, 0, 1, 1, 1, 1],
'y': ['a','b','c','d','e','f','g','h']})
z = dat['x'].copy()
for i in range(dat.shape[0]):
z[i] = ''
for j in range(i):
if dat['x'][j] == dat['x'][i]:
z[i] += dat['y'][j] + ","
but I was wondering whether there is a quicker way? It seems that pandas expanding().apply() doesn't work for strings and it is an open issue. But perhaps there is an efficient way of doing it which doesn't involve apply?
You can do with shift and np.cumsum in a custom function:
def myfun(x):
y = x.shift()
return np.cumsum(y.fillna('').add(',').mask(y.isna(),'')).str[:-1]
df.groupby("x")['y'].apply(myfun)
0
1 a
2 a,b
3 a,b,c
4
5 e
6 e,f
7 e,f,g
Name: y, dtype: object
We can group the dataframe by x then for each group in x we can cumsum and shift the column y and update the values in new column cum_y in dat
dat['cum_y'] = ''
for _, g in dat.groupby('x'):
dat['cum_y'].update(g['y'].add(',').cumsum().shift().str[:-1])
>>> dat
x y cum_y
0 0 a
1 0 b a
2 0 c a,b
3 0 d a,b,c
4 1 e
5 1 f e
6 1 g e,f
7 1 h e,f,g
Use GroupBy.transform with lambda function with Series.shift, adding ,, cumulative sum and last remove trailing separator:
f = lambda x: (x.shift(fill_value='') + ',').cumsum()
dat['z'] = dat.groupby('x')['y'].transform(f).str.strip(',')
print (dat)
x y z
0 0 a
1 0 b a
2 0 c a,b
3 0 d a,b,c
4 1 e
5 1 f e
6 1 g e,f
7 1 h e,f,g
I would try to use lists here. Unsure for the efficiency anyway...
df.assign(y=df['y'].apply(lambda x: [x])).groupby('x')['y'].transform(
lambda x: x.cumsum()).str.join(',')
It gives as expected:
0 a
1 a,b
2 a,b,c
3 a,b,c,d
4 e
5 e,f
6 e,f,g
7 e,f,g,h
Name: y, dtype: object
Can also do:
(df['y'].apply(list)
.groupby(df['x'])
.transform(lambda x: x.cumsum().shift(fill_value=''))
.str.join(',')
)
Output:
0
1 a
2 a,b
3 a,b,c
4
5 e
6 e,f
7 e,f,g
Name: y, dtype: object

Replace contents of cell with another cell if condition on a separate cell is met

I have to following data frame
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
we can see 4 columns A,B,C,D the intended outcome is to replace the contents of B with the contents of D, if a condition on C is met, for this example the condition is of C = 1
the intended output is
A = [1,2,5,4,3,1]
B = ["y","No","y","y","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
output_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
output_df.drop('D', axis = 1)
What is the best way to apply this logic to a data frame?
There are many ways to solve, here is another one:
test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
This can be done with np.where:
test_df['B'] = np.where(test_df['C']==1, test_df['D'], test_df['B'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
The desired output is achieved using .loc with column 'C' as the mask.
test_df.loc[test_df['C']==1,'B'] = test_df.loc[test_df['C']==1,'D']
UPDATE: Just found out a similar answer is posted by #QuangHoang. This answer is slightly different in that it does not require numpy
I don't know if inverse is the right word here, but I noticed recently that mask and where are "inverses" of each other. If you pass a ~ to the condition of a .where statement, then you get the same result as mask:
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
test_df['B'] = test_df['B'].where(~(test_df['C'] == 1), test_df['D'])
# test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D']) - Scott Boston's answer
test_df
Out[1]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
You can also use df.where:
test_df['B'] = test_df['D'].where(test_df.C.eq(1), test_df.B)
Output:
In [875]: test_df
Out[875]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

Change the values of column after having used groupby on another column (pandas dataframe)

I have two data frames, one with the coordinates of places
coord = pd.DataFrame()
coord['Index'] = ['A','B','C']
coord['x'] = np.random.random(coord.shape[0])
coord['y'] = np.random.random(coord.shape[0])
coord
Index x y
0 A 0.888025 0.376416
1 B 0.052976 0.396243
2 C 0.564862 0.30138
and one with several values measured in the places
df = pd.DataFrame()
df['Index'] = ['A','A','B','B','B','C','C','C','C']
df['Value'] = np.random.random(df.shape[0])
df
Index Value
0 A 0.930298
1 A 0.144550
2 B 0.393952
3 B 0.680941
4 B 0.657807
5 C 0.704954
6 C 0.733328
7 C 0.099785
8 C 0.871678
I want to find an efficient way of assigning the coordinates to the df data frame. For the moment I have tried
df['x'] = np.zeros(df.shape[0])
df['y'] = np.zeros(df.shape[0])
for i in df.Index.unique():
df.loc[df.Index == i, 'x'] = coord.loc[coord.Index == i,'x'].values
df.loc[df.Index == i, 'y'] = coord.loc[coord.Index == i,'y'].values
which works and yields
Index Value x y
0 A 0.220323 0.983739 0.121289
1 A 0.115075 0.983739 0.121289
2 B 0.432688 0.809586 0.639811
3 B 0.106178 0.809586 0.639811
4 B 0.259465 0.809586 0.639811
5 C 0.804018 0.827192 0.156095
6 C 0.552053 0.827192 0.156095
7 C 0.412345 0.827192 0.156095
8 C 0.235106 0.827192 0.156095
but this is quite sloppy, and highly inefficient. I tried to use the groupby operation like this
df['x'] =np.zeros(df.shape[0])
df['y'] =np.zeros(df.shape[0])
gb = df.groupby('Index')
for k in gb.groups.keys():
gb.get_group(k)['x'] = coord.loc[coord.Index == i ,'x']
gb.get_group(k)['y'] = coord.loc[coord.Index == i ,'y']
but I get this error here
/anaconda/lib/python2.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I understand the problem, but I dont know how to overcome it.
Any suggestions ?
merge is what you're looking for.
df
Index Value
0 A 0.930298
1 A 0.144550
2 B 0.393952
3 B 0.680941
4 B 0.657807
5 C 0.704954
6 C 0.733328
7 C 0.099785
8 C 0.871678
coord
Index x y
0 A 0.888025 0.376416
1 B 0.052976 0.396243
2 C 0.564862 0.301380
df.merge(coord, on='Index')
Index Value x y
0 A 0.930298 0.888025 0.376416
1 A 0.144550 0.888025 0.376416
2 B 0.393952 0.052976 0.396243
3 B 0.680941 0.052976 0.396243
4 B 0.657807 0.052976 0.396243
5 C 0.704954 0.564862 0.301380
6 C 0.733328 0.564862 0.301380
7 C 0.099785 0.564862 0.301380
8 C 0.871678 0.564862 0.301380

Categories

Resources