I have a dataframe:
id value
a1 1,2
b2 4
c1 NaN
c5 9,10,11
I want to create a new column mean_value which is equal to mean values in column value:
id value mean_value
a1 1,2 1.5
b2 4 4
c5 9,10,11 10
and I also want to remove those values in NaN in it. How to do that?
Here's one way using str.split and mean:
df = df.assign(mean_value=df['value'].str.split(',', expand=True).astype(float)
.mean(axis=1)).dropna()
Output:
id value mean_value
0 a1 1,2 1.5
1 b2 4 4.0
3 c5 9,10,11 10.0
Related
df1=
A B C D
a1 b1 c1 1
a2 b2 c2 2
a3 b3 c3 4
df2=
A B C D
a1 b1 c1 2
a2 b2 c2 1
I want to compare the value of the column 'D' in both dataframes. If both dataframes had same number of rows I would just do this.
newDF = df1['D']-df2['D']
However there are times when the number of rows are different. I want a result Dataframe which shows a dataframe like this.
resultDF=
A B C D_df1 D_df2 Diff
a1 b1 c1 1 2 -1
a2 b2 c2 2 1 1
EDIT: if 1st row in A,B,C from df1 and df2 is same then and only then compare 1st row of column D for each dataframe. Similarly, repeat for all the row.
Use merge and df.eval
df1.merge(df2, on=['A','B','C'], suffixes=['_df1','_df2']).eval('Diff=D_df1 - D_df2')
Out[314]:
A B C D_df1 D_df2 Diff
0 a1 b1 c1 1 2 -1
1 a2 b2 c2 2 1 1
I am trying to do the same as this answer, but with the difference that I'd like to ignore NaN in some cases. For instance:
#df1
c1 c2 c3
0 a b 1
1 a c 2
2 a nan 1
3 b nan 3
4 c d 1
5 d e 3
#df2
c1 c2 c4
0 a nan 1
1 a c 2
2 a x 1
3 b nan 3
4 z y 2
#merged output based on [c1, c2], dropping instances
#with `NaN` unless both dataframes have `NaN`.
c1 c2 c3 c4
0 a b 1 1 #c1,c2 from df1 because df2 has a nan in c2
1 a c 2 2 #in both
2 a x 1 1 #c1,c2 from df2 because df1 has a nan in c2
3 b nan 3 3 #c1,c2 as found in both
4 c d 1 nan #from df1
5 d e 3 nan #from df1
6 z y nan 2 #from df2
NaNs may come from either c1 or c2, but for this example I kept it simpler.
I'm not sure what's the cleanest way to do this. I was thinking to merge based on [c1,c2], and then loop by rows with nan, but this will not be so direct. Do you see a better way to do it?
Edit - clarifying conditions
1. No duplicates are found anywhere.
2. No combination is performed between two rows if they both have values. c1 may not be combined with c2, so order must be respected.
3. For the cases where one of the 2 dfs has a nan in either c1 or c2, find the rows in the other dataframe that don't have a full match on both c1+c2, and use it. For instance:
(a,c) has a match in both so it is no longer discussed.
(a,b) is only in df1. No b is found in df2.c2. The only row in df2 with a known key and a nan is row 0 so it is combined with this one. Note that order must be respected this is why (a,b) #df1 cannot be combined with any other row of df2 that also contains a b.
(a,x) is only in df2. No x is found in df1.c2. The only row in df1 with one of the known keys with a nan is row with index 2.
This question already has answers here:
Check if pandas dataframe is subset of other dataframe
(3 answers)
Closed 3 years ago.
I have 2 csv files (csv1, csv2). In csv2 there might be new column or row added in csv2.
I need to verify if csv1 is subset of csv2. For being a subset whole row should be present in both the files and elements from new coulmn or row should be ignored.
csv1:
c1,c2,c3
A,A,6
D,A,A
A,1,A
csv2:
c1,c2,c3,c4
A,A,6,L
A,changed,A,L
D,A,A,L
Z,1,A,L
Added,Anew,line,L
I am trying is :
df1 = pd.read_csv(csv1_file)
df2 = pd.read_csv(csv2_file)
matching_cols=df1.columns.intersection(df2.columns).tolist()
sorted_df1 = df1.sort_values(by=list(matching_cols)).reset_index(drop=True)
sorted_df2 = df2.sort_values(by=list(matching_cols)).reset_index(drop=True)
print("truth data>>>\n",sorted_df1)
print("Test data>>>\n",sorted_df2)
df1_mask = sorted_df1[matching_cols].eq(sorted_df2[matching_cols])
# print(df1_mask)
print("compared data>>>\n",sorted_df1[df1_mask])
It gives the out put as :
truth data>>>
c1 c2 c3
0 A 1 A
1 A A 6
2 D A A
Test data>>>
c1 c2 c3 c4
0 A A 6 L
1 A changed A L
2 Added Anew line L
3 D A A L
4 Z 1 A L
compared data>>>
c1 c2 c3
0 A NaN NaN
1 A NaN NaN
2 NaN NaN NaN
What i want is :
compared data>>>
c1 c2 c3
0 Nan NaN NaN
1 A A 6
2 D A A
Please help.
Thanks
If need missing values in last row, because no match, use DataFrame.merge with left join and indicator parameter, then set mising values by mask and rmove helper column _merge:
matching_cols=df1.columns.intersection(df2.columns)
df2 = df1[matching_cols].merge(df2[matching_cols], how='left', indicator=True)
df2.loc[df2['_merge'].ne('both')] = np.nan
df2 = df2.drop('_merge', axis=1)
print (df2)
c1 c2 c3
0 A A 6
1 D A A
2 NaN NaN NaN
I want to group by columns where the commutative rule applies.
For example
column 1, column 2 contains values (a,b) in the first row and (b,a) for another row, then I want to group these two records perform a group by operation.
Input:
From To Count
a1 b1 4
b1 a1 3
a1 b2 2
b3 a1 12
a1 b3 6
Output:
From To Count(+)
a1 b1 7
a1 b2 2
b3 a1 18
I tried to apply group by after swapping the elements. But I don't have any approach to solve this problem. Help me to solve this problem.
Thanks in advance.
Use numpy.sort for sorting each row:
cols = ['From','To']
df[cols] = pd.DataFrame(np.sort(df[cols], axis=1))
print (df)
From To Count
0 a1 b1 4
1 a1 b1 3
2 a1 b2 2
3 a1 b3 12
4 a1 b3 6
df1 = df.groupby(cols, as_index=False)['Count'].sum()
print (df1)
From To Count
0 a1 b1 7
1 a1 b2 2
2 a1 b3 18
Now I would like to handle dataframe
df
A B
1 A0
1 A1
1 B0
2 B1
2 B2
3 B3
3 A2
3 A3
First, I would like to group by df.A
sub1
A B
1 A0
1 A1
1 B0
Second, I would like to extract first rows which contains letter A
A B
1 A0
If there is no A
sub2
A B
2 B1
2 B2
I would like to extract the first rows
A B
2 B1
So, I would like to get the result below
A B
1 A0
2 B1
3 A2
I would like to handle priority extraction,I tried grouping but Couldnt figure out. How to handle this?
You can groupby column A and for each group use idxmax() on str.contains("A"), then if there is A in column B, it will get the first index which contains letter A, otherwise it falls back to the first row as all values are False:
df.groupby("A", as_index=False).apply(lambda g: g.loc[g.B.str.contains("A").idxmax()])
# A B
#0 1 A0
#1 2 B1
#2 3 A2
In cases where you may have duplicated index, you can use numpy.ndarray.argmax() with iloc which accepts integer as position indexing:
df.groupby("A", as_index=False).apply(lambda g: g.iloc[g.B.str.contains("A").values.argmax()])
# A B
#0 1 A0
#1 2 B1
#2 3 A2