For the following data frame df1:
sentence A B C D F G
dizzy 1 1 0 0 k 1
Head 0 0 1 0 l 1
nausea 0 0 0 1 fd 1
zap 1 0 1 0 g 1
dizziness 0 0 0 1 V 1
I need to create a dictionary from column sentence with columns A, B, C,and D.
In the next step, I need to map sentences column in data frame F2 to the value A, B, C, and D. The output is like this:
sentences A B C D
dizzy 1 1 0 0
happy
Head 0 0 1 0
nausea 0 0 0 1
fill out
zap 1 0 1 0
dizziness 0 0 0 1
This is my code, but just for one column, I do not know how to do it for several columns:
equiv = df1.set_index (sentences)[A].to_dict()
df2[A]=df2[sentences].apply (lambda x:equiv.get(x, np.nan))
Thanks.
IIUC:
Setup:
In [164]: df1
Out[164]:
sentence A B C D F G
0 dizzy 1 1 0 0 k 1
1 Head 0 0 1 0 l 1
2 nausea 0 0 0 1 fd 1
3 zap 1 0 1 0 g 1
4 dizziness 0 0 0 1 V 1
In [165]: df2
Out[165]:
sentences
0 dizzy
1 happy
2 Head
3 nausea
4 fill out
5 zap
6 dizziness
Solution:
In [174]: df2[['sentences']].merge(df1[['sentence','A','B','C','D']],
left_on='sentences',
right_on='sentence',
how='outer')
Out[174]:
sentences sentence A B C D
0 dizzy dizzy 1.0 1.0 0.0 0.0
1 happy NaN NaN NaN NaN NaN
2 Head Head 0.0 0.0 1.0 0.0
3 nausea nausea 0.0 0.0 0.0 1.0
4 fill out NaN NaN NaN NaN NaN
5 zap zap 1.0 0.0 1.0 0.0
6 dizziness dizziness 0.0 0.0 0.0 1.0
Related
For an array, say, a = np.array([1,2,1,0,0,1,1,2,2,2]), something like an adjacency "matrix" A needs to be created. I.e. A is a symmetric (n, n) numpy array where n = len(a) and A[i,j] = 1 if a[i] == a[j] and 0 otherwise (i = 0...n-1 and j = 0...n-1):
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 1 1 0 0 0
1 1 0 0 0 0 0 1 1 1
2 1 0 0 1 1 0 0 0
3 1 1 0 0 0 0 0
4 1 0 0 0 0 0
5 1 1 0 0 0
6 1 0 0 0
7 1 1 1
8 1 1
9 1
The trivial solution is
n = len(a)
A = np.zeros([n, n]).astype(int)
for i in range(n):
for j in range(n):
if a[i] == a[j]:
A[i, j] = 1
else:
A[i, j] = 0
Can this be done in a numpy way, i.e. without loops?
You can use numpy broadcasting:
b = (a[:,None]==a).astype(int)
df = pd.DataFrame(b)
output:
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 1 1 0 0 0
1 0 1 0 0 0 0 0 1 1 1
2 1 0 1 0 0 1 1 0 0 0
3 0 0 0 1 1 0 0 0 0 0
4 0 0 0 1 1 0 0 0 0 0
5 1 0 1 0 0 1 1 0 0 0
6 1 0 1 0 0 1 1 0 0 0
7 0 1 0 0 0 0 0 1 1 1
8 0 1 0 0 0 0 0 1 1 1
9 0 1 0 0 0 0 0 1 1 1
If you want the upper triangle only, use numpy.tril_indices:
b = (a[:,None]==a).astype(float)
b[np.tril_indices_from(b, k=-1)] = np.nan
df = pd.DataFrame(b)
output:
0 1 2 3 4 5 6 7 8 9
0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
1 NaN 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
2 NaN NaN 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
3 NaN NaN NaN 1.0 1.0 0.0 0.0 0.0 0.0 0.0
4 NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 0.0
5 NaN NaN NaN NaN NaN 1.0 1.0 0.0 0.0 0.0
6 NaN NaN NaN NaN NaN NaN 1.0 0.0 0.0 0.0
7 NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0
8 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
I have a dataframe that looks like this:
A B C D
0 abc 0 cdf
abf 0 0 afg
And I want to replace any string value with 1.
The expected outcome should look like:
A B C D
0 1 0 1
1 0 0 1
Any help on how to do this is appreciated..
The safe way
df.apply(pd.to_numeric,errors = 'coerce').fillna(1)
Out[217]:
A B C D
0 0.0 1.0 0 1.0
1 1.0 0.0 0 1.0
And for the show case work
(~df.isin([0,'0'])).astype(int)
Out[221]:
A B C D
0 0 1 0 1
1 1 0 0 1
I am new to pandas. I am facing an issue with null values. I have a list of 3 values which has to be inserted into a column of missing values how do I do that?
In [57]: df
Out[57]:
a b c d
0 0 1 2 3
1 0 NaN 0 1
2 0 Nan 3 4
3 0 1 2 5
4 0 Nan 2 6
In [58]: list = [11,22,44]
The output I want
Out[57]:
a b c d
0 0 1 2 3
1 0 11 0 1
2 0 22 3 4
3 0 1 2 5
4 0 44 2 6
If your list is same length as the no of NaN:
l=[11,22,44]
df.loc[df['b'].isna(),'b'] = l
print(df)
a b c d
0 0 1.0 2 3
1 0 11.0 0 1
2 0 22.0 3 4
3 0 1.0 2 5
4 0 44.0 2 6
Try with stack and assign the value then unstack back
s = df.stack(dropna=False)
s.loc[s.isna()] = l # chnage the list name to l here, since override the original python and panda function and object name will create future warning
df = s.unstack()
df
Out[178]:
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 11.0 0.0 1.0
2 0.0 22.0 3.0 4.0
3 0.0 1.0 2.0 5.0
4 0.0 44.0 2.0 6.0
I want to take the sum of values (row-wise) of columns that start with the same text string. Underneath is my original df with fails on courses.
Original df:
ID P_English_2 P_English_3 P_German_1 P_Math_1 P_Math_3 P_Physics_2 P_Physics_4
56 1 3 1 2 0 0 3
11 0 0 0 1 4 1 0
6 0 0 0 0 0 1 0
43 1 2 1 0 0 1 1
14 0 1 0 0 1 0 0
Desired df:
ID P_English P_German P_Math P_Physics
56 4 1 2 3
11 0 0 5 1
6 0 0 0 1
43 3 1 0 2
14 1 0 1 0
Tried code:
import pandas as pd
df = pd.DataFrame({"ID": [56,11,6,43,14],
"P_Math_1": [2,1,0,0,0],
"P_English_3": [3,0,0,2,1],
"P_English_2": [1,0,0,1,0],
"P_Math_3": [0,4,0,0,1],
"P_Physics_2": [0,1,1,1,0],
"P_Physics_4": [3,0,0,1,0],
"P_German_1": [1,0,0,1,0]})
print(df)
categories = ['P_Math', 'P_English', 'P_Physics', 'P_German']
def correct_categories(cols):
return [cat for col in cols for cat in categories if col.startswith(cat)]
result = df.groupby(correct_categories(df.columns),axis=1).sum()
print(result)
Let's try groupby with axis=1:
# extract the subjects
subjects = [x[0] for x in df.columns.str.rsplit('_',n=1)]
df.groupby(subjects, axis=1).sum()
Output:
ID P_English P_German P_Math P_Physics
0 56 4 1 2 3
1 11 0 0 5 1
2 6 0 0 0 1
3 43 3 1 0 2
4 14 1 0 1 0
Or you can use wide_to_long, assuming ID are unique valued:
(pd.wide_to_long(df, stubnames=categories,
i=['ID'], j='count', sep='_')
.groupby('ID').sum()
)
Output:
P_Math P_English P_Physics P_German
ID
56 2.0 4.0 3.0 1.0
11 5.0 0.0 1.0 0.0
6 0.0 0.0 1.0 0.0
43 0.0 3.0 2.0 1.0
14 1.0 1.0 0.0 0.0
What is the best way to add the contents of two dataframes, which have mostly equivalent indices:
df1:
A B C
A 0 3 1
B 3 0 2
C 1 2 0
df2:
A B C D
A 0 1 1 0
B 1 0 3 2
C 1 3 0 0
D 0 2 0 0
df1 + df2 =
A B C D
A 0 4 2 0
B 4 0 5 2
C 2 5 0 0
D 0 2 0 0
You can also concat both the dataframes since concatenation (by default) happens by index.
# sample dataframe
df1 = pd.DataFrame({'a': [1,2,3], 'b':[2,3,4]}, index=['a','c','e'])
df2 = pd.DataFrame({'a': [10,20], 'b':[11,22]}, index=['b','d'])
new_df= pd.concat([df1, df2]).sort_index()
print(new_df)
a b
a 1 2
b 10 11
c 2 3
d 20 22
e 3 4
I think you can just add:
In [625]: df1.add(df2,fill_value=0)
Out[625]:
A B C D
A 0.0 4.0 2.0 0.0
B 4.0 0.0 5.0 2.0
C 2.0 5.0 0.0 0.0
D 0.0 2.0 0.0 0.0