pd.Dataframe.update puts the result at the top of the dataframe - python

Lets say I have two dataframes like this:
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n)
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(n)
I want to update the zeroes in the z column in the nf dataframe with the values from the z column in the mf dataframe only in the rows with keys from the column x
when i call
nf.update(mf)
i get
x y z
b 1 10
d 2 100
e 3 1000
d 4 0
e 5 0
instead of the desired output
x y z
a 1 0
b 2 10
c 3 0
d 4 100
e 5 1000

To answer your problem, you need to match the indexes of both dataframes, here how you can do it :
n = {'x':['a','b','c','d','e'], 'y':['1','2','3','4','5'],'z':['0','0','0','0','0']}
nf = pd.DataFrame(n).set_index('x')
m = {'x':['b','d','e'], 'z':['10','100','1000']}
mf = pd.DataFrame(m).set_index('x')
nf.update(mf)
nf = nf.reset_index()

Related

Replace contents of cell with another cell if condition on a separate cell is met

I have to following data frame
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
we can see 4 columns A,B,C,D the intended outcome is to replace the contents of B with the contents of D, if a condition on C is met, for this example the condition is of C = 1
the intended output is
A = [1,2,5,4,3,1]
B = ["y","No","y","y","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
output_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
output_df.drop('D', axis = 1)
What is the best way to apply this logic to a data frame?
There are many ways to solve, here is another one:
test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
This can be done with np.where:
test_df['B'] = np.where(test_df['C']==1, test_df['D'], test_df['B'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
The desired output is achieved using .loc with column 'C' as the mask.
test_df.loc[test_df['C']==1,'B'] = test_df.loc[test_df['C']==1,'D']
UPDATE: Just found out a similar answer is posted by #QuangHoang. This answer is slightly different in that it does not require numpy
I don't know if inverse is the right word here, but I noticed recently that mask and where are "inverses" of each other. If you pass a ~ to the condition of a .where statement, then you get the same result as mask:
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
test_df['B'] = test_df['B'].where(~(test_df['C'] == 1), test_df['D'])
# test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D']) - Scott Boston's answer
test_df
Out[1]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
You can also use df.where:
test_df['B'] = test_df['D'].where(test_df.C.eq(1), test_df.B)
Output:
In [875]: test_df
Out[875]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n

Adding a column to a dataframe after every nth column

I have a dataframe of 9,000 columns and 100 rows. I want to insert a column after every 3rd column such that its value is equal to 50 for all rows.
Existing DataFrame
0 1 2 3 4 5 6 7 8 9....9000
0 a b c d e f g h i j ....x
1 k l m n o p q r s t ....x
.
.
100 u v w x y z aa bb cc....x
Desired DataFrame
0 1 2 3 4 5 6 7 8 9....12000
0 a b c 50 d e f 50 g h i j ....x
1 k l m 50 n o p 50 q r s t ....x
.
.
100 u v w 50 x y z 50 aa bb cc....x
Create new DataFrame by indexing each 3rd column, add .5 for correct sorting and add to original with concat:
df.columns = np.arange(len(df.columns))
df1 = pd.DataFrame(50, index=df.index, columns= df.columns[2::3] + .5)
df2 = pd.concat([df, df1], axis=1).sort_index(axis=1)
df2.columns = np.arange(len(df2.columns))
print (df2)
0 1 2 3 4 5 6 7 8 9 10 11 12
0 a b c 50 d e f 50 g h i 50 j
1 k l m 50 n o p 50 q r s 50 t
Numpy
# How many columns to group
x = 3
# Get the shape of things
a = df.to_numpy()
m, n = a.shape
k = n // x
# Get only a multiple of x columns and reshape
b = a[:, :k * x].reshape(m, k, x)
# Get the other columns missed by b
c = a[:, k * x:]
# array of 50's that we'll append to the last dimension
_50 = np.ones((m, k, 1), np.int64) * 50
# append 50's and reshape back to 2D
d = np.append(b, _50, axis=2).reshape(m, k * (x + 1))
# Create DataFrame while appending the missing bit
pd.DataFrame(np.append(d, c, axis=1))
0 1 2 3 4 5 6 7 8 9 10 11 12
0 a b c 50 d e f 50 g h i 50 j
1 k l m 50 n o p 50 q r s 50 t
Setup
df = pd.DataFrame(np.reshape([*'abcdefghijklmnopqrst'], (2, -1)))
So here is one solution
s=pd.concat([y.assign(new=50) for x, y in df.groupby(np.arange(df.shape[1])//3,axis=1)],axis=1)
s.columns=np.arange(s.shape[1])

Creating a column by addition of two adjacent rows with a condition

Create column E that fills column C. If D is <10, then it fill C of earlier row and current row.
This is my Input DataSet:
I,A,B,C,D
1,P,100+,L,15
2,P,100+,M,9
3,P,100+,N,15
4,P,100+,O,15
5,Q,100+,L,2
6,Q,100+,M,15
7,Q,100+,N,3
8,Q,100+,O,15
I tried using some for loops. However, i think we can use shift or append functions to complete this. However, i am getting value errors using the shift function.
Desired Output:
I,A,B,C,D,E
1,P,100+,L,15,L
2,P,100+,M,9,M+N
3,P,100+,N,15,M+N
4,P,100+,O,15,O
5,Q,100+,L,2,L+O
6,Q,100+,M,15,M+N
7,Q,100+,N,3,M+N
8,Q,100+,O,15,L+O
I am working out the column E given in desired output table above.
using np.where and pd.shift
##will populate C values index+1 where the condition is True
df['E'] = np.where( df['D'] < 10,df.loc[df.index + 1,'C'] , df['C'])
##Appending the values of C and E
df['E'] = df.apply(lambda x: x.C + '+' + x.E if x.C != x.E else x.C, axis=1)
df['F'] = df['E'].shift(1)
##Copying the values at index+1 position where the condition is True
df['E'] = df.apply(lambda x: x.F if '+' in str(x.F) else x.E, axis=1)
df.drop('F', axis=1, inplace=True)
Output
I A B C D E
0 1 P 100+ L 15 L
1 2 P 100+ M 9 M+N
2 3 P 100+ N 15 M+N
3 4 P 100+ O 15 O
4 5 Q 100+ L 2 L+M
5 6 Q 100+ M 15 L+M
6 7 Q 100+ N 3 N+O
7 8 Q 100+ O 15 N+O
Idea is create helper groups by replace values of index by mask with Series.where and forward filling only one missing value, then set new column by numpy.where with GroupBy.transform and join:
m = df['D'].lt(10)
g = df.index.to_series().where(m).ffill(limit=1)
df['E'] = np.where(g.notna(), df['C'].groupby(g.fillna(-1)).transform('+'.join), df['C'])
print (df)
I A B C D E
0 1 P 100+ L 15 L
1 2 P 100+ M 9 M+N
2 3 P 100+ N 15 M+N
3 4 P 100+ O 15 O
4 5 Q 100+ L 2 L+M
5 6 Q 100+ M 15 L+M
6 7 Q 100+ N 3 N+O
7 8 Q 100+ O 15 N+O

pandas apply and applymap functions are taking long time to run on large dataset

I have two functions applied on a dataframe
res = df.apply(lambda x:pd.Series(list(x)))
res = res.applymap(lambda x: x.strip('"') if isinstance(x, str) else x)
{{Update}} Dataframe has got almost 700 000 rows. This is taking much time to run.
How to reduce the running time?
Sample data :
A
----------
0 [1,4,3,c]
1 [t,g,h,j]
2 [d,g,e,w]
3 [f,i,j,h]
4 [m,z,s,e]
5 [q,f,d,s]
output:
A B C D E
-------------------------
0 [1,4,3,c] 1 4 3 c
1 [t,g,h,j] t g h j
2 [d,g,e,w] d g e w
3 [f,i,j,h] f i j h
4 [m,z,s,e] m z s e
5 [q,f,d,s] q f d s
This line of code res = df.apply(lambda x:pd.Series(list(x))) takes items from a list and fill one by one to each column as shown above. There will be almost 38 columns.
I think:
res = df.apply(lambda x:pd.Series(list(x)))
should be changed to:
df1 = pd.DataFrame(df['A'].values.tolist())
print (df1)
0 1 2 3
0 1 4 3 c
1 t g h j
2 d g e w
3 f i j h
4 m z s e
5 q f d s
And second if not mixed columns values - numeric with strings:
cols = res.select_dtypes(object).columns
res[cols] = res[cols].apply(lambda x: x.str.strip('"'))

Change the values of column after having used groupby on another column (pandas dataframe)

I have two data frames, one with the coordinates of places
coord = pd.DataFrame()
coord['Index'] = ['A','B','C']
coord['x'] = np.random.random(coord.shape[0])
coord['y'] = np.random.random(coord.shape[0])
coord
Index x y
0 A 0.888025 0.376416
1 B 0.052976 0.396243
2 C 0.564862 0.30138
and one with several values measured in the places
df = pd.DataFrame()
df['Index'] = ['A','A','B','B','B','C','C','C','C']
df['Value'] = np.random.random(df.shape[0])
df
Index Value
0 A 0.930298
1 A 0.144550
2 B 0.393952
3 B 0.680941
4 B 0.657807
5 C 0.704954
6 C 0.733328
7 C 0.099785
8 C 0.871678
I want to find an efficient way of assigning the coordinates to the df data frame. For the moment I have tried
df['x'] = np.zeros(df.shape[0])
df['y'] = np.zeros(df.shape[0])
for i in df.Index.unique():
df.loc[df.Index == i, 'x'] = coord.loc[coord.Index == i,'x'].values
df.loc[df.Index == i, 'y'] = coord.loc[coord.Index == i,'y'].values
which works and yields
Index Value x y
0 A 0.220323 0.983739 0.121289
1 A 0.115075 0.983739 0.121289
2 B 0.432688 0.809586 0.639811
3 B 0.106178 0.809586 0.639811
4 B 0.259465 0.809586 0.639811
5 C 0.804018 0.827192 0.156095
6 C 0.552053 0.827192 0.156095
7 C 0.412345 0.827192 0.156095
8 C 0.235106 0.827192 0.156095
but this is quite sloppy, and highly inefficient. I tried to use the groupby operation like this
df['x'] =np.zeros(df.shape[0])
df['y'] =np.zeros(df.shape[0])
gb = df.groupby('Index')
for k in gb.groups.keys():
gb.get_group(k)['x'] = coord.loc[coord.Index == i ,'x']
gb.get_group(k)['y'] = coord.loc[coord.Index == i ,'y']
but I get this error here
/anaconda/lib/python2.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I understand the problem, but I dont know how to overcome it.
Any suggestions ?
merge is what you're looking for.
df
Index Value
0 A 0.930298
1 A 0.144550
2 B 0.393952
3 B 0.680941
4 B 0.657807
5 C 0.704954
6 C 0.733328
7 C 0.099785
8 C 0.871678
coord
Index x y
0 A 0.888025 0.376416
1 B 0.052976 0.396243
2 C 0.564862 0.301380
df.merge(coord, on='Index')
Index Value x y
0 A 0.930298 0.888025 0.376416
1 A 0.144550 0.888025 0.376416
2 B 0.393952 0.052976 0.396243
3 B 0.680941 0.052976 0.396243
4 B 0.657807 0.052976 0.396243
5 C 0.704954 0.564862 0.301380
6 C 0.733328 0.564862 0.301380
7 C 0.099785 0.564862 0.301380
8 C 0.871678 0.564862 0.301380

Categories

Resources