I have a dataframe like this:
df_ex_A = pd.DataFrame({'X':['r','r','t','t','v','w'],
'A':[3,4,1,2,1,1],
'A_val':[25,25,100,20,10,90]})
Out[115]:
X A A_val
0 r 3 25
1 r 4 25
2 t 1 100
3 t 2 20
4 v 1 10
5 w 1 90
and another df like this:
df_ex_B = pd.DataFrame({ 'X':['r','r','t','t','v','w'],
'B':[4,5,2,3,2,2],
'B_val':[75,65,30,0,0,0]})
Out[117]:
X B B_val
0 r 4 75
1 r 5 65
2 t 2 30
3 t 3 0
4 v 2 0
5 w 2 0
I want to create df by merge operations on equal values of A and B, like this:
X (A==B) A_val B_val
0 r 3 25 0
1 r 4 25 75
2 r 5 0 65
3 t 1 1 0
4 t 2 20 30
5 t 3 0 0
6 v 1 10 0
7 v 2 0 0
8 w 1 90 0
9 w 2 0 0
how to execute a merge to get this df?
Thanks
Let's try using set_index and pd.concat:
dfA = df_ex_A.set_index(['X','A']).rename_axis(['X','A==B'])
dfB = df_ex_B.set_index(['X','B']).rename_axis(['X','A==B'])
pd.concat([dfA,dfB], axis=1).fillna(0).reset_index()
Output:
X A==B A_val B_val
0 r 3 25.0 0.0
1 r 4 25.0 75.0
2 r 5 0.0 65.0
3 t 1 100.0 0.0
4 t 2 20.0 30.0
5 t 3 0.0 0.0
6 v 1 10.0 0.0
7 v 2 0.0 0.0
8 w 1 90.0 0.0
9 w 2 0.0 0.0
Or you can use join after setting indexes and renaming axis:
dfA.join(dfB, how='outer').fillna(0).reset_index()
Output:
X A==B A_val B_val
0 r 3 25.0 0.0
1 r 4 25.0 75.0
2 r 5 0.0 65.0
3 t 1 100.0 0.0
4 t 2 20.0 30.0
5 t 3 0.0 0.0
6 v 1 10.0 0.0
7 v 2 0.0 0.0
8 w 1 90.0 0.0
9 w 2 0.0 0.0
I think what you want is an outer join which can be done with merge specifying how='outer':
df_ex_A.merge(df_ex_B.rename(columns={'B':'A'}), how='outer').fillna(0).rename(columns={'A':'A==B'})
Related
I want to group by the id column in this dataframe:
id a b c
0 1 1 6 2
1 1 2 5 2
2 2 3 4 2
3 2 4 3 2
4 3 5 2 2
5 3 6 1 2
and add the differences between rows for the same column and group as additional columns to end up with this dataframe:
id a b c a_diff b_diff c_diff
0 1 1 6 2 -1.0 1.0 0.0
1 1 2 5 2 1.0 -1.0 0.0
2 2 3 4 2 -1.0 1.0 0.0
3 2 4 3 2 1.0 -1.0 0.0
4 3 5 2 2 -1.0 1.0 0.0
5 3 6 1 2 1.0 -1.0 0.0
data here
df = pd.DataFrame({'id': [1,1,2,2,3,3], 'a': [1,2,3,4,5,6],'b': [6,5,4,3,2,1], 'c': [2,2,2,2,2,2]})
Your desired output doesn't make much sense, but I can force it there with:
df[['a_diff', 'b_diff', 'c_diff']] = df.groupby('id').transform(lambda x: x.diff(1).fillna(x.diff(-1)))
Output:
id a b c a_diff b_diff c_diff
0 1 1 6 2 -1.0 1.0 0.0
1 1 2 5 2 1.0 -1.0 0.0
2 2 3 4 2 -1.0 1.0 0.0
3 2 4 3 2 1.0 -1.0 0.0
4 3 5 2 2 -1.0 1.0 0.0
5 3 6 1 2 1.0 -1.0 0.0
i want to find the cumulative count before there is a change in value, i.e. how many rows since the last change. For illustration:
Value
diff
#row since last change (how do I create this column?)
6
na
na
5
-1
0
5
0
1
5
0
2
4
-1
0
4
0
1
4
0
2
4
0
3
4
0
4
5
1
0
5
0
1
5
0
2
5
0
3
6
1
0
7
1
0
i tried to use cumsum but it does not reset after each change
IIUC, use a cumcount per group:
df['new'] = df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
output:
Value diff new
0 6 na 0
1 5 -1 0
2 5 0 1
3 5 0 2
4 4 -1 0
5 4 0 1
6 4 0 2
7 4 0 3
8 4 0 4
9 5 1 0
10 5 0 1
11 5 0 2
12 5 0 3
13 6 1 0
14 7 1 0
If you want the NaN based on diff: you can mask the output:
df['new'] = (df.groupby(df['Value'].ne(df['Value'].shift()).cumsum()).cumcount()
.mask(df['diff'].isna())
)
output:
Value diff new
0 6 NaN NaN
1 5 -1.0 0.0
2 5 0.0 1.0
3 5 0.0 2.0
4 4 -1.0 0.0
5 4 0.0 1.0
6 4 0.0 2.0
7 4 0.0 3.0
8 4 0.0 4.0
9 5 1.0 0.0
10 5 0.0 1.0
11 5 0.0 2.0
12 5 0.0 3.0
13 6 1.0 0.0
14 7 1.0 0.0
If performance is important count consecutive 0 values from difference column:
m = df['diff'].eq(0)
b = m.cumsum()
df['out'] = b.sub(b.mask(m).ffill().fillna(0)).astype(int)
print (df)
Value diff need out
0 6 NaN na 0
1 5 -1.0 0 0
2 5 0.0 1 1
3 5 0.0 2 2
4 4 -1.0 0 0
5 4 0.0 1 1
6 4 0.0 2 2
7 4 0.0 3 3
8 4 0.0 4 4
9 5 1.0 0 0
10 5 0.0 1 1
11 5 0.0 2 2
12 5 0.0 3 3
13 6 1.0 0 0
14 7 1.0 0 0
I am new to pandas. I am facing an issue with null values. I have a list of 3 values which has to be inserted into a column of missing values how do I do that?
In [57]: df
Out[57]:
a b c d
0 0 1 2 3
1 0 NaN 0 1
2 0 Nan 3 4
3 0 1 2 5
4 0 Nan 2 6
In [58]: list = [11,22,44]
The output I want
Out[57]:
a b c d
0 0 1 2 3
1 0 11 0 1
2 0 22 3 4
3 0 1 2 5
4 0 44 2 6
If your list is same length as the no of NaN:
l=[11,22,44]
df.loc[df['b'].isna(),'b'] = l
print(df)
a b c d
0 0 1.0 2 3
1 0 11.0 0 1
2 0 22.0 3 4
3 0 1.0 2 5
4 0 44.0 2 6
Try with stack and assign the value then unstack back
s = df.stack(dropna=False)
s.loc[s.isna()] = l # chnage the list name to l here, since override the original python and panda function and object name will create future warning
df = s.unstack()
df
Out[178]:
a b c d
0 0.0 1.0 2.0 3.0
1 0.0 11.0 0.0 1.0
2 0.0 22.0 3.0 4.0
3 0.0 1.0 2.0 5.0
4 0.0 44.0 2.0 6.0
I have a dataframe that is a 5252 rows x 3 columns
data look something like this
X Y Z
1 1 2
1 2 4
1 3 3.5
2 13 4
1 4 3
2 14 3.5
3 14 2
3 15 1
4 16 .5
4 18 2
. . .
. . .
. . .
1508 751 1
1508 669 1
1508 686 2.5
I want to convert it so the userid is the rows and itemid is the column and Z is the data correspond to X and Y. Something like this:
1 2 3 4 5 6 13 14 15 16 17 18 669 686
1 2 4 3.5 3 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 4 4.5 0 0 0 0 0 0
3 0 0 0 0 0 0 0 2 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 .5 0 2 0 0
.
.
.
1508 0 0 0 0 0 0 0 0 0 0 0 0 1 1
I assume you're using pandas library.
You need pd.pivot_table function. If the dataframe is called df, then you need:
pd.pivot_table(data=df, index="x", columns="y", values="z", aggfunc=sum)
You need to use pd.pivot_table() and use fillna(0). Recreating your sample dataframe:
import pandas as pd
df = pd.DataFrame({'X': [1,1,1,1,2,2,3,3,4], 'Y': [1,2,3,4,13,14,14,15,16], 'Z': [2,4,3.5,3,4,3.5,2,1,.5]})
Gives:
X Y Z
0 1 1 2.0
1 1 2 4.0
2 1 3 3.5
3 1 4 3.0
4 2 13 4.0
5 2 14 3.5
6 3 14 2.0
7 3 15 1.0
8 4 16 0.5
Then using pd.pivot_table():
pd.pivot_table(df, values='Z', index=['X'], columns=['Y']).fillna(0)
Yields:
Y 1 2 3 4 13 14 15 16
X
1 2.0 4.0 3.5 3.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 4.0 3.5 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 2.0 1.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5
Here I have an example Datarame.
A={'a_1':[1,2,3,4,5],'a_2':[6,7,8,9,4],'a_3':[0,6,2,4,7],'a_4':[3,5,2,4,6],
'b_1':[1,2,6,4,3],'b_2':[6,7,3,2,4],'b_3':[0,7,2,4,7],'b_4':[3,3,2,4,8]
}
data=pd.DataFrame.from_dict(A)
output:
a_1 a_2 a_3 a_4 b_1 b_2 b_3 b_4
1 6 0 3 1 6 0 3
2 7 6 5 2 7 7 3
3 8 2 2 6 3 2 2
4 9 4 4 4 2 4 4
5 4 7 6 3 4 7 8
What I want to do is to compare the difference of columns start with a and columns start with b with 0.
like
max(data[a_]- data[b_], 0)
Does anyone know how could I apply such a function on the dataframe?
What I have tried is something like
def test_(row,column_1,column_2):
result=max(row[column_1].any() - row[column_2].any(),0)
data['result']=np.nan
for i in range(1,5):
data['result']=data.apply(test_(data,'a'+str(i),'b'+str(i)))
This won't work.
You can use numpy's maximum which applies to the whole column. Then, just iterate over all your numbered columns and append a new column to your dataframe as
import numpy as np
for i in range(1,5):
data['result_' + str(i)] = np.maximum(data['a_' + str(i)] - data['b_' + str(i)], 0)
You can groupby columns then using diff
df=data.groupby(data.columns.str.split('_').str[1].values,axis=1).diff().dropna(1)
df
Out[347]:
b_1 b_2 b_3 b_4
0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.0 -2.0
2 3.0 -5.0 0.0 0.0
3 0.0 -7.0 0.0 0.0
4 -2.0 0.0 0.0 2.0
Then we using mask
df.mask(df<0,0)
Out[349]:
b_1 b_2 b_3 b_4
0 0.0 0.0 0.0 0.0
1 0.0 0.0 1.0 0.0
2 3.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 2.0
A={'a_1':[1,2,3,4,5],'a_2':[6,7,8,9,4],'a_3':[0,6,2,4,7],'a_4':[3,5,2,4,6],
'b_1':[1,2,6,4,3],'b_2':[6,7,3,2,4],'b_3':[0,7,2,4,7],'b_4':[3,3,2,4,8]
}
data=pd.DataFrame.from_dict(A)
x = data.iloc[:,0:4].values - data.iloc[:,4:].values
print(x)
x = pd.DataFrame(x)
print(x)
output:
[[ 0 0 0 0]
[ 0 0 -1 2]
[-3 5 0 0]
[ 0 7 0 0]
[ 2 0 0 -2]]
0 1 2 3
0 0 0 0 0
1 0 0 -1 2
2 -3 5 0 0
3 0 7 0 0
4 2 0 0 -2