Pandas merge on multiple columns ignoring NaN - python

I am trying to do the same as this answer, but with the difference that I'd like to ignore NaN in some cases. For instance:
#df1
c1 c2 c3
0 a b 1
1 a c 2
2 a nan 1
3 b nan 3
4 c d 1
5 d e 3
#df2
c1 c2 c4
0 a nan 1
1 a c 2
2 a x 1
3 b nan 3
4 z y 2
#merged output based on [c1, c2], dropping instances
#with `NaN` unless both dataframes have `NaN`.
c1 c2 c3 c4
0 a b 1 1 #c1,c2 from df1 because df2 has a nan in c2
1 a c 2 2 #in both
2 a x 1 1 #c1,c2 from df2 because df1 has a nan in c2
3 b nan 3 3 #c1,c2 as found in both
4 c d 1 nan #from df1
5 d e 3 nan #from df1
6 z y nan 2 #from df2
NaNs may come from either c1 or c2, but for this example I kept it simpler.
I'm not sure what's the cleanest way to do this. I was thinking to merge based on [c1,c2], and then loop by rows with nan, but this will not be so direct. Do you see a better way to do it?
Edit - clarifying conditions
1. No duplicates are found anywhere.
2. No combination is performed between two rows if they both have values. c1 may not be combined with c2, so order must be respected.
3. For the cases where one of the 2 dfs has a nan in either c1 or c2, find the rows in the other dataframe that don't have a full match on both c1+c2, and use it. For instance:
(a,c) has a match in both so it is no longer discussed.
(a,b) is only in df1. No b is found in df2.c2. The only row in df2 with a known key and a nan is row 0 so it is combined with this one. Note that order must be respected this is why (a,b) #df1 cannot be combined with any other row of df2 that also contains a b.
(a,x) is only in df2. No x is found in df1.c2. The only row in df1 with one of the known keys with a nan is row with index 2.

Related

How to shift a dataframe element-wise to fill NaNs?

I have a DataFrame like this:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
I am trying to fill NaN with values of the previous column in the next row and dropping this second row. In other words, I want to combine the two rows with NaNs to form a single row without NaNs like this:
a b
0 A E
1 B C
2 D F
I have tried various flavors of df.fillna(method="<bfill/ffill>") but this didn't give me the expected output.
I haven't found any other question about this problem, Here's one. And actually that DataFrame is made from list of DataFrame by doing .concat(), you may notice that from indexes also. I am telling this because it may be easy to do in single row rather then in multiple rows.
I have found some suggestions to use shift, combine_first but non of them worked for me. You may try these too.
I also have found this too. It is a whole article about filling nan values but I haven't found problem/answer like mine.
OK misunderstood what you wanted to do the first time. The dummy example was a bit ambiguous.
Here is another:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
To my knowledge, this operation does not exist with pandas, so we will use numpy to do the work.
First transform the dataframe to numpy array and flatten it to be one-dimensional. Then drop NaNs using pandas.isna that is working on a larger range types than numpy.isnan, and then reshape the array to its original shape before transforming back to dataframe:
array = df.to_numpy().flatten()
pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
output:
a b
0 A E
1 B C
2 D F
It is also working for more complex examples, as long as the NaN pattern is conserved among columns with NaNs:
In:
a b c d
0 A H A2 H2
1 B NaN B2 NaN
2 C NaN C2 NaN
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
Out:
a b c d
0 A H A2 H2
1 B B2 C C2
2 D I D2 I2
3 E E2 F F2
4 G J G2 J2
In:
a b c
0 A F H
1 B NaN NaN
2 C NaN NaN
3 D NaN NaN
4 E G I
Out:
a b c
0 A F H
1 B C D
2 E G I
In case NaNs columns do not have the same pattern such as:
a b c d
0 A H A2 NaN
1 B NaN B2 NaN
2 C NaN C2 H2
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
You can apply the operation per group of two columns:
def elementwise_shift(df):
array = df.to_numpy().flatten()
return pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
(df.groupby(np.repeat(np.arange(df.shape[1]/2), 2), axis=1)
.apply(elementwise_shift)
)
output:
a b c d
0 A H A2 B2
1 B C C2 H2
2 D I D2 I2
3 E F E2 F2
4 G J G2 J2
You can do this in two steps with a placeholder column. First you fill all the nans in column b with the a values from the next row. Then you apply the filtering. In this example I use ffill with a limit of 1 to filter all nan values after the first, there's probably a better method.
import pandas as pd
import numpy as np
df=pd.DataFrame({"a":[1,2,3,3,4],"b":[1,2,np.nan,np.nan,4]})
# Fill all nans:
df['new_b'] = df['b'].fillna(df['a'].shift(-1))
df = df[df['b'].ffill(limit=1).notna()].copy() # .copy() because loc makes a view
df = df.drop('b', axis=1).rename(columns={'new_b': 'b'})
print(df)
# output:
# a b
# 0 1 1
# 1 2 2
# 2 3 2
# 4 4 4

Need to check if a data frame is subset of another data frame [duplicate]

This question already has answers here:
Check if pandas dataframe is subset of other dataframe
(3 answers)
Closed 3 years ago.
I have 2 csv files (csv1, csv2). In csv2 there might be new column or row added in csv2.
I need to verify if csv1 is subset of csv2. For being a subset whole row should be present in both the files and elements from new coulmn or row should be ignored.
csv1:
c1,c2,c3
A,A,6
D,A,A
A,1,A
csv2:
c1,c2,c3,c4
A,A,6,L
A,changed,A,L
D,A,A,L
Z,1,A,L
Added,Anew,line,L
I am trying is :
df1 = pd.read_csv(csv1_file)
df2 = pd.read_csv(csv2_file)
matching_cols=df1.columns.intersection(df2.columns).tolist()
sorted_df1 = df1.sort_values(by=list(matching_cols)).reset_index(drop=True)
sorted_df2 = df2.sort_values(by=list(matching_cols)).reset_index(drop=True)
print("truth data>>>\n",sorted_df1)
print("Test data>>>\n",sorted_df2)
df1_mask = sorted_df1[matching_cols].eq(sorted_df2[matching_cols])
# print(df1_mask)
print("compared data>>>\n",sorted_df1[df1_mask])
It gives the out put as :
truth data>>>
c1 c2 c3
0 A 1 A
1 A A 6
2 D A A
Test data>>>
c1 c2 c3 c4
0 A A 6 L
1 A changed A L
2 Added Anew line L
3 D A A L
4 Z 1 A L
compared data>>>
c1 c2 c3
0 A NaN NaN
1 A NaN NaN
2 NaN NaN NaN
What i want is :
compared data>>>
c1 c2 c3
0 Nan NaN NaN
1 A A 6
2 D A A
Please help.
Thanks
If need missing values in last row, because no match, use DataFrame.merge with left join and indicator parameter, then set mising values by mask and rmove helper column _merge:
matching_cols=df1.columns.intersection(df2.columns)
df2 = df1[matching_cols].merge(df2[matching_cols], how='left', indicator=True)
df2.loc[df2['_merge'].ne('both')] = np.nan
df2 = df2.drop('_merge', axis=1)
print (df2)
c1 c2 c3
0 A A 6
1 D A A
2 NaN NaN NaN

How to repeat pandas dataframe records based on column value

I'm trying to duplicate rows of a pandas DataFrame (v.0.23.4, python v.3.7.1) based on an int value in one of the columns. I'm applying code from this question to do that, but I'm running into the following data type casting error: TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'. Basically, I'm not understanding why this code is attempting to cast to int32.
Starting with this,
dummy_dict = {'c1': ['a','b','c'],
'c2': [0,1,2]}
dummy_df = pd.DataFrame(dummy_dict)
c1 c2 c3
0 a 0 textA
1 b 1 textB
2 c 2 textC
I'm doing this
dummy_df_test = dummy_df.reindex(dummy_df.index.repeat(dummy_df['c2']))
I want this at the end. However, I'm getting the above error instead.
c1 c2 c3
0 a 0 textA
1 b 1 textB
2 c 2 textC
3 c 2 textC
Just a workaround:
pd.concat([dummy_df[dummy_df.c2.eq(0)],dummy_df.loc[dummy_df.index.repeat(dummy_df.c2)]])
Another fantastic suggestion courtesy #Wen
dummy_df.reindex(dummy_df.index.repeat(dummy_df['c2'].clip(lower=1)))
c1 c2
0 a 0
1 b 1
2 c 2
2 c 2
I believe the answer as to why it's happening can be found here:
https://github.com/numpy/numpy/issues/4384
Specifying the dtype as int32 should solve the problem as highlighted in the original comment.
In the first attempt all rows are duplicated, and in the second attempt just the row with the index 2. Thanks to the concat function.
df2 = pd.concat([df]*2, ignore_index=True)
print(df2)
df3= pd.concat([df, df.iloc[[2]]])
print(df3)
c1 c2 c3
0 a 0 textA
1 b 1 textB
2 c 2 textC
c1 c2 c3
0 a 0 textA
1 b 1 textB
2 c 2 textC
3 a 0 textA
4 b 1 textB
5 c 2 textC
c1 c2 c3
0 a 0 textA
1 b 1 textB
2 c 2 textC
2 c 2 textC
If you plan to reset the index at the end
df3=df3.reset_index(drop=True)

Determining when order of a set of columns changes in pandas dataframe

I have a very large csv file with following structure:
a1 b1 c1 a2 b2 c2 a3 b3 c3 ..... a999 b999 c999
0 5 4 2 3 2 2 6 7 9 ....................
1 2 1 4 4 6 9 3 5 9 ....................
.
.
What I want to do is to group the columns in sets of N, for a, b and c, and check when the index of maximum value (argmax) of the set changes, in each row.
So in the above example, for N = 3, a1, b1, c1 is the first set in row 0, and argmax is 0, 2nd set is a2, b2, c2 and argmax is still 0, 3rd set is a3, b3, c3 but now the argmax is 2. I deally I am looking for a script that parses the whole csv file and returns [c3, c1]. c3 because thats where the argmax changes in row 0 and c1 becuase argmax doesn't change in row 1 but c1 is the largest value in that set.
I am doing this right now by using two for loops and its slow and looks very ugly, is there a better pandas pythonic way of doing this? I feel there must be.
I tried to keep to code as simple as possible. You can translate your dataframe and group by the sliced column name:
df = df.T.reset_index()
idx = df.groupby(df['index'].str.slice(1,2)).idxmax()
Output:
0 1
index
1 0 2
2 3 5
3 8 8
That means that for row 0 the max for group 1 is at index 0, the max group 2 is at index 3 (or 0 is you take the mod 3), the max for group 3 is at index 8, (or 2 if you take mod 3). Same reading for row 1 :)
If you need the actual column name:
df.columns[idx.values.flatten(order='F')]
Output:
['a1', 'a2', 'c3', 'c1', 'c2', 'c3']
You can groupby sets of columns and use .idxmax to find the column where the maximum occurs within each set. You can find where the first letter changes (if it ever does) to get your list.
n = 3
df2 = df.groupby([x//n for x in range(len(df.columns))], axis=1).idxmax(1)
mask = df2.applymap(lambda x: x[0]) # Case of 1-letter column prefix
## If possibility of words with different length ending in digits try
# import string
# mask = df2.applymap(lambda x: x.strip(string.digits))
df2.lookup(df2.index,
(mask.ne(mask.shift(-1, axis=1)).idxmax(1)+1) % (len(mask.columns))).tolist()
Sample Data
print(df)
a1 b1 c1 a2 b2 c2 a3 b3 c3
0 5 4 2 3 2 2 6 7 9
1 2 1 4 4 6 9 3 5 9
2 2 1 4 10 6 9 3 5 9
3 2 1 4 1 6 9 3 10 9
n = 3
df2 = df.groupby([x//n for x in range(len(df.columns))], axis=1).idxmax(1)
print(df2)
# 0 1 2
#0 a1 a2 c3
#1 c1 c2 c3
#2 c1 a2 c3
#3 c1 c2 b3
mask = df2.applymap(lambda x: x[0])
df2.lookup(df2.index, (mask.ne(mask.shift(-1, axis=1)).idxmax(1)+1) % (len(mask.columns))).tolist()
#['c3', 'c1', 'a2', 'b3']

transpose dataframe changing columns names

Could you please help me in transforming the dataframe df
df=pd.DataFrame(data=[['a1',2,3],['b1',5,6],['c1',8,9]],columns=['A','B','C'])
df
Out[37]:
A B C
0 a1 2 3
1 b1 5 6
2 c1 8 9
in df2
df2=pd.DataFrame(data=[[2,5,8],[3,6,9]],columns=['a1','b1','c1'])
df2
Out[36]:
a1 b1 c1
0 2 5 8
1 3 6 9
The first column should become the column names
and then I should transpose the elements...is there a pythonic way?
A little trick with slicing, initialise a new DataFrame.
pd.DataFrame(df.values.T[1:], columns=df.A.tolist())
Or,
pd.DataFrame(df.values[:, 1:].T, columns=df.A.tolist())
a1 b1 c1
0 2 5 8
1 3 6 9
For general solution use set_index with transpose:
df1 = df.set_index('A').T.reset_index(drop=True).rename_axis(None)
Or remove column A, transpose and build new DataFrame by constructor:
df1 = pd.DataFrame(df.drop('A', 1).T.values, columns=df['A'].values)
print (df1)
a1 b1 c1
0 2 5 8
1 3 6 9

Categories

Resources