I have two dataframes. (a,b,c,d ) and (i,j,k) are the name columns dataframes
df1 =
a b c d
0 1 2 3
0 1 2 3
0 1 2 3
df2 =
i j k
0 1 2
0 1 2
0 1 2
I want to select the entries that df1 is df2
I want to obtain
df1=
a b c
0 1 2
0 1 2
0 1 2
You can use isin for compare df1 with each column of df2:
dfs = []
for i in range(len(df2.columns)):
df = df1.isin(df2.iloc[:,i])
dfs.append(df)
Then concat all mask together:
mask = pd.concat(dfs).groupby(level=0).sum()
print (mask)
a b c d
0 True True True False
1 True True True False
2 True True True False
Apply boolean indexing:
print (df1.ix[:, mask.all()])
a b c
0 0 1 2
1 0 1 2
2 0 1 2
Doing a column-wise comparison would give the desired result:
df1 = df1[(df1.a == df2.i) & (df1.b == df2.j) & (df1.c == df2.k)][['a','b','c']]
You get only those rows from df1 where the values of the first three columns are identical to those of df2.
Then you just select the rows 'a','b','c' from df1.
Related
I have a panda dataframe (here represented using excel):
Now I would like to delete all dublicates (1) of a specific row (B).
How can I do it ?
For this example, the result would look like that:
You can use duplicated for boolean mask and then set NaNs by loc, mask or numpy.where:
df.loc[df['B'].duplicated(), 'B'] = np.nan
df['B'] = df['B'].mask(df['B'].duplicated())
df['B'] = np.where(df['B'].duplicated(), np.nan,df['B'])
Alternative if need remove duplicates rows by B column:
df = df.drop_duplicates(subset=['B'])
Sample:
df = pd.DataFrame({
'B': [1,2,1,3],
'A':[1,5,7,9]
})
print (df)
A B
0 1 1
1 5 2
2 7 1
3 9 3
df.loc[df['B'].duplicated(), 'B'] = np.nan
print (df)
A B
0 1 1.0
1 5 2.0
2 7 NaN
3 9 3.0
df = df.drop_duplicates(subset=['B'])
print (df)
A B
0 1 1
1 5 2
3 9 3
I have dataframes:
df1:
| |A|B|C|D|E|
|0|1|2|3|4|5|
|1|1|3|4|5|0|
|2|3|1|2|3|5|
|3|2|3|1|2|6|
|4|2|5|1|2|3|
df2:
| |K|L|M|N|
|0|1|3|4|2|
|1|1|2|5|3|
|2|3|2|3|1|
|3|1|4|5|0|
|4|2|2|3|6|
|5|2|1|2|7|
What I need to do is match column A of df1 with column k of df2; column C of df1 with L of df2; and column D of df1 with column M of df2. If the values are matched the corresponding value of N in df2 should be assigned to a new column F in df1. The output should be:
| |A|B|C|D|E|F|
|0|1|2|3|4|5|2|
|1|1|3|4|5|0|0|
|2|3|1|2|3|5|1|
|3|2|3|1|2|6|7|
|4|2|5|1|2|3|7|
Use DataFrame.merge with left join and rename columns for match:
df = df1.merge(df2.rename(columns={'K':'A','L':'C','M':'D', 'N':'F'}), how='left')
print (df)
A B C D E F
0 1 2 3 4 5 2
1 1 3 4 5 0 0
2 3 1 2 3 5 1
3 2 3 1 2 6 7
4 2 5 1 2 3 7
df3 = df1.join(df2)
F = []
for _, row in df3.iterrows():
if row['A'] == row['K'] and row['C'] == row['L'] and row['D'] == row['M']:
F.append(row['N'])
else:
F.append(0)
df1['F'] = F
df1
I have to dataframes that look like this:
df1: condition
A
A
A
B
B
B
B
df2: condition value
A 1
B 2
I would like to assign to each condition its value, adding a column to df1 in order to obtain:
df1: condition value
A 1
A 1
A 1
B 2
B 2
B 2
B 2
how can I do this? thank you in advance!
Use map by Series created by set_index if need append one column only:
df1['value'] = df1['condition'].map(df2.set_index('condition')['value'])
print (df1)
condition value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 B 2
Or use merge with left join if df2 have more columns:
df = df1.merge(df2, on='condition', how='left')
print (df)
condition value
0 A 1
1 A 1
2 A 1
3 B 2
4 B 2
5 B 2
6 B 2
I have a dataframe:
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1]}
df = pd.DataFrame(data=d)
which looks like-
A B class
0 0 4 0
1 4 1 1
2 8 0 1
3 1 0 0
4 0 3 1
5 0 1 0
I want to calculate for each column the corresponding a,b,c,d which are no of non-zero in column corresponding to class column 1,no of non-zero in column corresponding to class column 0,no of zero in column corresponding to class column 1,no of zero in column corresponding to class column 0
for example-
for column A the a,b,c,d are 2,1,1,2
explantion- In column A we see that where column[class]=1 the number of non zero values in column A are 2 therefore a=2(indices 1,2).Similarly b=1(indices 3)
My attempt(when the dataframe had equal no of 0 and 1 class)-
dataset = pd.read_csv('aaf.csv')
n=len(dataset.columns) #no of columns
X=dataset.iloc[:,1:n].values
l=len(X) #no or rows
score = []
for i in range(n-1):
#print(i)
X_column=X[:,i]
neg_array,pos_array=np.hsplit(X_column,2)##hardcoded
#print(pos_array.size)
a=np.count_nonzero(pos_array)
b=np.count_nonzero(neg_array)
c= l/2-a
d= l/2-b
Use:
d = {'class': [0, 1,1,0,1,0], 'A': [0,4,8,1,0,0],'B':[4,1,0,0,3,1]}
df = pd.DataFrame(data=d)
df = (df.set_index('class')
.ne(0)
.stack()
.groupby(level=[0,1])
.value_counts()
.unstack(1)
.sort_index(level=1, ascending=False)
.T)
print (df)
class 1 0 1 0
True True False False
A 2 1 1 2
B 2 2 1 1
df.columns = list('abcd')
print (df)
a b c d
A 2 1 1 2
B 2 2 1 1
>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?
import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.