Here I have an example Datarame.
A={'a_1':[1,2,3,4,5],'a_2':[6,7,8,9,4],'a_3':[0,6,2,4,7],'a_4':[3,5,2,4,6],
'b_1':[1,2,6,4,3],'b_2':[6,7,3,2,4],'b_3':[0,7,2,4,7],'b_4':[3,3,2,4,8]
}
data=pd.DataFrame.from_dict(A)
output:
a_1 a_2 a_3 a_4 b_1 b_2 b_3 b_4
1 6 0 3 1 6 0 3
2 7 6 5 2 7 7 3
3 8 2 2 6 3 2 2
4 9 4 4 4 2 4 4
5 4 7 6 3 4 7 8
What I want to do is to compare the difference of columns start with a and columns start with b with 0.
like
max(data[a_]- data[b_], 0)
Does anyone know how could I apply such a function on the dataframe?
What I have tried is something like
def test_(row,column_1,column_2):
result=max(row[column_1].any() - row[column_2].any(),0)
data['result']=np.nan
for i in range(1,5):
data['result']=data.apply(test_(data,'a'+str(i),'b'+str(i)))
This won't work.
You can use numpy's maximum which applies to the whole column. Then, just iterate over all your numbered columns and append a new column to your dataframe as
import numpy as np
for i in range(1,5):
data['result_' + str(i)] = np.maximum(data['a_' + str(i)] - data['b_' + str(i)], 0)
You can groupby columns then using diff
df=data.groupby(data.columns.str.split('_').str[1].values,axis=1).diff().dropna(1)
df
Out[347]:
b_1 b_2 b_3 b_4
0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.0 -2.0
2 3.0 -5.0 0.0 0.0
3 0.0 -7.0 0.0 0.0
4 -2.0 0.0 0.0 2.0
Then we using mask
df.mask(df<0,0)
Out[349]:
b_1 b_2 b_3 b_4
0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.0 0.0
2 3.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 2.0
A={'a_1':[1,2,3,4,5],'a_2':[6,7,8,9,4],'a_3':[0,6,2,4,7],'a_4':[3,5,2,4,6],
'b_1':[1,2,6,4,3],'b_2':[6,7,3,2,4],'b_3':[0,7,2,4,7],'b_4':[3,3,2,4,8]
}
data=pd.DataFrame.from_dict(A)
x = data.iloc[:,0:4].values - data.iloc[:,4:].values
print(x)
x = pd.DataFrame(x)
print(x)
output:
[[ 0 0 0 0]
[ 0 0 -1 2]
[-3 5 0 0]
[ 0 7 0 0]
[ 2 0 0 -2]]
0 1 2 3
0 0 0 0 0
1 0 0 -1 2
2 -3 5 0 0
3 0 7 0 0
4 2 0 0 -2
Related
I want to group by the id column in this dataframe:
id a b c
0 1 1 6 2
1 1 2 5 2
2 2 3 4 2
3 2 4 3 2
4 3 5 2 2
5 3 6 1 2
and add the differences between rows for the same column and group as additional columns to end up with this dataframe:
id a b c a_diff b_diff c_diff
0 1 1 6 2 -1.0 1.0 0.0
1 1 2 5 2 1.0 -1.0 0.0
2 2 3 4 2 -1.0 1.0 0.0
3 2 4 3 2 1.0 -1.0 0.0
4 3 5 2 2 -1.0 1.0 0.0
5 3 6 1 2 1.0 -1.0 0.0
data here
df = pd.DataFrame({'id': [1,1,2,2,3,3], 'a': [1,2,3,4,5,6],'b': [6,5,4,3,2,1], 'c': [2,2,2,2,2,2]})
Your desired output doesn't make much sense, but I can force it there with:
df[['a_diff', 'b_diff', 'c_diff']] = df.groupby('id').transform(lambda x: x.diff(1).fillna(x.diff(-1)))
Output:
id a b c a_diff b_diff c_diff
0 1 1 6 2 -1.0 1.0 0.0
1 1 2 5 2 1.0 -1.0 0.0
2 2 3 4 2 -1.0 1.0 0.0
3 2 4 3 2 1.0 -1.0 0.0
4 3 5 2 2 -1.0 1.0 0.0
5 3 6 1 2 1.0 -1.0 0.0
For an array, say, a = np.array([1,2,1,0,0,1,1,2,2,2]), something like an adjacency "matrix" A needs to be created. I.e. A is a symmetric (n, n) numpy array where n = len(a) and A[i,j] = 1 if a[i] == a[j] and 0 otherwise (i = 0...n-1 and j = 0...n-1):
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 1 1 0 0 0
1 1 0 0 0 0 0 1 1 1
2 1 0 0 1 1 0 0 0
3 1 1 0 0 0 0 0
4 1 0 0 0 0 0
5 1 1 0 0 0
6 1 0 0 0
7 1 1 1
8 1 1
9 1
The trivial solution is
n = len(a)
A = np.zeros([n, n]).astype(int)
for i in range(n):
for j in range(n):
if a[i] == a[j]:
A[i, j] = 1
else:
A[i, j] = 0
Can this be done in a numpy way, i.e. without loops?
You can use numpy broadcasting:
b = (a[:,None]==a).astype(int)
df = pd.DataFrame(b)
output:
0 1 2 3 4 5 6 7 8 9
0 1 0 1 0 0 1 1 0 0 0
1 0 1 0 0 0 0 0 1 1 1
2 1 0 1 0 0 1 1 0 0 0
3 0 0 0 1 1 0 0 0 0 0
4 0 0 0 1 1 0 0 0 0 0
5 1 0 1 0 0 1 1 0 0 0
6 1 0 1 0 0 1 1 0 0 0
7 0 1 0 0 0 0 0 1 1 1
8 0 1 0 0 0 0 0 1 1 1
9 0 1 0 0 0 0 0 1 1 1
If you want the upper triangle only, use numpy.tril_indices:
b = (a[:,None]==a).astype(float)
b[np.tril_indices_from(b, k=-1)] = np.nan
df = pd.DataFrame(b)
output:
0 1 2 3 4 5 6 7 8 9
0 1.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
1 NaN 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
2 NaN NaN 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0
3 NaN NaN NaN 1.0 1.0 0.0 0.0 0.0 0.0 0.0
4 NaN NaN NaN NaN 1.0 0.0 0.0 0.0 0.0 0.0
5 NaN NaN NaN NaN NaN 1.0 1.0 0.0 0.0 0.0
6 NaN NaN NaN NaN NaN NaN 1.0 0.0 0.0 0.0
7 NaN NaN NaN NaN NaN NaN NaN 1.0 1.0 1.0
8 NaN NaN NaN NaN NaN NaN NaN NaN 1.0 1.0
9 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0
I have a dataframe that is a 5252 rows x 3 columns
data look something like this
X Y Z
1 1 2
1 2 4
1 3 3.5
2 13 4
1 4 3
2 14 3.5
3 14 2
3 15 1
4 16 .5
4 18 2
. . .
. . .
. . .
1508 751 1
1508 669 1
1508 686 2.5
I want to convert it so the userid is the rows and itemid is the column and Z is the data correspond to X and Y. Something like this:
1 2 3 4 5 6 13 14 15 16 17 18 669 686
1 2 4 3.5 3 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 4 4.5 0 0 0 0 0 0
3 0 0 0 0 0 0 0 2 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 .5 0 2 0 0
.
.
.
1508 0 0 0 0 0 0 0 0 0 0 0 0 1 1
I assume you're using pandas library.
You need pd.pivot_table function. If the dataframe is called df, then you need:
pd.pivot_table(data=df, index="x", columns="y", values="z", aggfunc=sum)
You need to use pd.pivot_table() and use fillna(0). Recreating your sample dataframe:
import pandas as pd
df = pd.DataFrame({'X': [1,1,1,1,2,2,3,3,4], 'Y': [1,2,3,4,13,14,14,15,16], 'Z': [2,4,3.5,3,4,3.5,2,1,.5]})
Gives:
X Y Z
0 1 1 2.0
1 1 2 4.0
2 1 3 3.5
3 1 4 3.0
4 2 13 4.0
5 2 14 3.5
6 3 14 2.0
7 3 15 1.0
8 4 16 0.5
Then using pd.pivot_table():
pd.pivot_table(df, values='Z', index=['X'], columns=['Y']).fillna(0)
Yields:
Y 1 2 3 4 13 14 15 16
X
1 2.0 4.0 3.5 3.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 4.0 3.5 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5
I have a dataframe like this:
df_ex_A = pd.DataFrame({'X':['r','r','t','t','v','w'],
'A':[3,4,1,2,1,1],
'A_val':[25,25,100,20,10,90]})
Out[115]:
X A A_val
0 r 3 25
1 r 4 25
2 t 1 100
3 t 2 20
4 v 1 10
5 w 1 90
and another df like this:
df_ex_B = pd.DataFrame({ 'X':['r','r','t','t','v','w'],
'B':[4,5,2,3,2,2],
'B_val':[75,65,30,0,0,0]})
Out[117]:
X B B_val
0 r 4 75
1 r 5 65
2 t 2 30
3 t 3 0
4 v 2 0
5 w 2 0
I want to create df by merge operations on equal values of A and B, like this:
X (A==B) A_val B_val
0 r 3 25 0
1 r 4 25 75
2 r 5 0 65
3 t 1 1 0
4 t 2 20 30
5 t 3 0 0
6 v 1 10 0
7 v 2 0 0
8 w 1 90 0
9 w 2 0 0
how to execute a merge to get this df?
Thanks
Let's try using set_index and pd.concat:
dfA = df_ex_A.set_index(['X','A']).rename_axis(['X','A==B'])
dfB = df_ex_B.set_index(['X','B']).rename_axis(['X','A==B'])
pd.concat([dfA,dfB], axis=1).fillna(0).reset_index()
Output:
X A==B A_val B_val
0 r 3 25.0 0.0
1 r 4 25.0 75.0
2 r 5 0.0 65.0
3 t 1 100.0 0.0
4 t 2 20.0 30.0
5 t 3 0.0 0.0
6 v 1 10.0 0.0
7 v 2 0.0 0.0
8 w 1 90.0 0.0
9 w 2 0.0 0.0
Or you can use join after setting indexes and renaming axis:
dfA.join(dfB, how='outer').fillna(0).reset_index()
Output:
X A==B A_val B_val
0 r 3 25.0 0.0
1 r 4 25.0 75.0
2 r 5 0.0 65.0
3 t 1 100.0 0.0
4 t 2 20.0 30.0
5 t 3 0.0 0.0
6 v 1 10.0 0.0
7 v 2 0.0 0.0
8 w 1 90.0 0.0
9 w 2 0.0 0.0
I think what you want is an outer join which can be done with merge specifying how='outer':
df_ex_A.merge(df_ex_B.rename(columns={'B':'A'}), how='outer').fillna(0).rename(columns={'A':'A==B'})
I have the following DataFrames:
A =
0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
B =
0 5
0 1 1
5 1 1
I want to 'join' these two frames such that:
A + B =
0 1 2 5
0 2 1 1 1
1 1 1 1 0
2 1 1 1 0
5 1 0 0 1
where A+B is a new dataframe
Using add
df1.add(df2,fill_value=0).fillna(0)
Out[217]:
0 1 2 5
0 2.0 1.0 1.0 1.0
1 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 0.0
5 1.0 0.0 0.0 1.0
If you need int
df1.add(df2,fill_value=0).fillna(0).astype(int)
Out[242]:
0 1 2 5
0 2 1 1 1
1 1 1 1 0
2 1 1 1 0
5 1 0 0 1
import numpy as np
import pandas as pd
A = pd.DataFrame(np.ones(9).reshape(3, 3))
B = pd.DataFrame(np.ones(4).reshape(2, 2), columns=[0, 5], index=[0, 5])
A.add(B, fill_value=0).fillna(0)
[Out]
0 1 2 5
0 2.0 1.0 1.0 1.0
1 1.0 1.0 1.0 0.0
2 1.0 1.0 1.0 0.0
5 1.0 0.0 0.0 1.0