Apply function for two dataframes in pandas - python

I have two dataframe.
df0
a b
c 0.3 0.6
d 0.4 NaN
df1
a b
c 3 2
d 0 4
I have a custom function:
def concat(d0,d1):
if d0 is not None and d1 is not None:
return '%s,%s' % (d0, d1)
return None
Result I expect:
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
How could I apply the function for those two dataframe?

Here is a solution.
The idea is first to reduce your dataframes to a flat list of values. This allows you to loop over the value of the two dataframes using zip and applying your function.
Finally, you go back to original shape using numpy reshape
new_vals = [concat(d0,d1) for d0, d1 in zip(df1.values.flat, df2.values.flat)]
result = pd.DataFrame(np.reshape(new_vals, (2, 2)), index = ['c', 'd'], columns = ['a', 'b'])

If you it's your specific application, you can do :
#Concatenate the two as String
df = df0.astype(str) + "," +df1.astype(str)
#Remove the nan
df = df.applymap(lambda x: x if 'nan' not in x else np.nan)
You'll be better performance wise than using apply
output
a b
c 0.3,3 0.6,2
d 0.4,0 NaN

Use add with applymap and mask:
df = df0.astype(str).add(',').add(df1.astype(str))
df = df.mask(df.applymap(lambda x: 'nan' in x))
print (df)
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
Another solution is last replace NaN by conditions with mask, by default Trues are replaced to NaN:
df = df0.astype(str).add(',').add(df1.astype(str))
m = df0.isnull() | df1.isnull()
print (m)
a b
c False False
d False True
df = df.mask(m)
print (df)
a b
c 0.3,3 0.6,2
d 0.4,0 NaN

Related

How two combine two columns of different dataframes such that they have unique values?

I have two different dataframes and I want to get the sorted
values of two columns.
Setup
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
'id': range(7),
'c': list('EDBBCCC')
})
df2 = pd.DataFrame({
'id': range(8),
'c': list('EBBCCCAA')
})
Desired Output
# notice that ABCDE appear in alphabetical order
c_first c_second
NAN A
B B
C C
D NAN
E E
What I've tried
pd.concat([df1.c.sort_values().drop_duplicates().rename('c_first'),
df2.c.sort_values().drop_duplicates().rename('c_second')
],axis=1)
How to get the output as given in required format?
Here one possible way to achive it:
t1 = df1.c.drop_duplicates()
t2 = df2.c.drop_duplicates()
tmp1 = pd.DataFrame({'id':t1, 'c_first':t1})
tmp2 = pd.DataFrame({'id':t2, 'c_second':t2})
result = pd.merge(tmp1,tmp2, how='outer').sort_values('id').drop('id', axis=1)
result
c_first c_second
4 NaN A
0 B B
1 C C
2 D NaN
3 E E
https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.concat.html
There is an argument in concat function.
Try to add sort=True.

How I can merge the columns into a single column in Python?

I want to merge 3 columns into a single column. I have tried changing the column types. However, I could not do it.
For example, I have 3 columns such as A: {1,2,4}, B:{3,4,4}, C:{1,1,1}
Output expected: ABC Column {131, 241, 441}
My inputs are like this:
df['ABC'] = df['A'].map(str) + df['B'].map(str) + df['C'].map(str)
df.head()
ABC {13.01.0 , 24.01.0, 44.01.0}
The type of ABC seems object and I could not change via str, int.
df['ABC'].apply(str)
Also, I realized that there are NaN values in A, B, C column. Is it possible to merge these even with NaN values?
# Example
import pandas as pd
import numpy as np
df = pd.DataFrame()
# Considering NaN's in the data-frame
df['colA'] = [1,2,4, np.NaN,5]
df['colB'] = [3,4,4,3,np.NaN]
df['colC'] = [1,1,1,4,1]
# Using pd.isna() to check for NaN values in the columns
df['colA'] = df['colA'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colB'] = df['colB'].apply(lambda x: x if pd.isna(x) else str(int(x)))
df['colC'] = df['colC'].apply(lambda x: x if pd.isna(x) else str(int(x)))
# Filling the NaN values with a blank space
df = df.fillna('')
# Transform columns into string
df = df.astype(str)
# Concatenating all together
df['ABC'] = df.sum(axis=1)
A workaround your NaN problem could look like this but now NaN will be 0
import numpy as np
df = pd.DataFrame({'A': [1,2,4, np.nan], 'B':[3,4,4,4], 'C':[1,np.nan,1, 3]})
df = df.replace(np.nan, 0, regex=True).astype(int).applymap(str)
df['ABC'] = df['A'] + df['B'] + df['C']
output
A B C ABC
0 1 3 1 131
1 2 4 0 240
2 4 4 1 441
3 0 4 3 043

Reassigning Entries in a Column of Pandas DataFrame

My goal is to conditionally index a data frame and change the values in a column for these indexes.
I intend on looking through the column 'A' to find entries = 'a' and update their column 'B' with the word 'okay.
group = ['a']
df = pd.DataFrame({"A": [a,b,a,a,c], "B": [NaN,NaN,NaN,NaN,NaN]})
>>>df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
df[df['A'].apply(lambda x: x in group)]['B'].fillna('okay', inplace=True)
This gives me the following error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._update_inplace(new_data)
Following the documentation (what I understood of it) I tried the following instead:
df[df['A'].apply(lambda x: x in group)].loc[:,'B'].fillna('okay', inplace=True)
I can't figure out why the reassignment of 'NaN' to 'okay' is not occurring inplace and how this can be rectified?
Thank you.
Try this with lambda:
Solution First:
>>> df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
Using lambda + map or apply..
>>> df["B"] = df["A"].map(lambda x: "okay" if "a" in x else "NaN")
OR# df["B"] = df["A"].map(lambda x: "okay" if "a" in x else np.nan)
OR# df['B'] = df['A'].apply(lambda x: 'okay' if x == 'a' else np.nan)
>>> df
A B
0 a okay
1 b NaN
2 a okay
3 a okay
4 c NaN
Solution second:
>>> df
A B
0 a NaN
1 b NaN
2 a NaN
3 a NaN
4 c NaN
another fancy way to Create Dictionary frame and apply it using map function across the column:
>>> frame = {'a': "okay"}
>>> df['B'] = df['A'].map(frame)
>>> df
A B
0 a okay
1 b NaN
2 a okay
3 a okay
4 c NaN
Solution Third:
This is already been posted by #d_kennetz but Just want to club together, wher you can also do the assignment to both columns (A & B)in one shot:..
>>> df.loc[df.A == 'a', 'B'] = "okay"
If I understand this correctly, you simply want to replace the value for a column on those rows matching a given condition (i.e. where A column belongs to a certain group, here with a single value 'a'). The following should do the trick:
import pandas as pd
group = ['a']
df = pd.DataFrame({"A": ['a','b','a','a','c'], "B": [None,None,None,None,None]})
print(df)
df.loc[df['A'].isin(group),'B'] = 'okay'
print(df)
What we're doing here is we're using the .loc filter, which just returns a view on the existing dataframe.
First argument (df['A'].isin(group)) filters on those rows matching a given criterion. Notice you can use the equality operator (==) but not the in operator and therefore have to use .isin() instead).
Second argument selects only the 'B' column.
Then you just assign the desired value (which is a constant).
Here's the output:
A B
0 a None
1 b None
2 a None
3 a None
4 c None
A B
0 a okay
1 b None
2 a okay
3 a okay
4 c None
If you wanted to fancier stuff, you might want do the following:
import pandas as pd
group = ['a', 'b']
df = pd.DataFrame({"A": ['a','b','a','a','c'], "B": [None,None,None,None,None]})
df.loc[df['A'].isin(group),'B'] = "okay, it was " + df['A']+df['A']
print(df)
Which gives you:
A B
0 a okay, it was aa
1 b okay, it was bb
2 a okay, it was aa
3 a okay, it was aa
4 c None

Python Pandas - Remove Duplicates with Inverse Values [duplicate]

I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df
Out[8]:
a b c d
1 x y e f
2 e f x y
3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate.
I want to drop these lines and only keep one, to have the final output:
df_new
Out[20]:
a b c d
1 x y e f
3 w v s t
How can I efficiently achieve that?
I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
['e' 'f' 'x' 'y']
['s' 't' 'v' 'w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
0 1 2 3
1 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 False
2 True
3 False
dtype: bool
print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 True
2 False
3 True
dtype: bool
Here's another solution, with a for loop:
data = df.as_matrix()
new = []
for row in data:
if not new:
new.append(row)
else:
if not any([c in nrow for nrow in new for c in row]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)
Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.
Later use that duplicates to drop(df.drop) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])

Pandas find Duplicates in cross values

I have a dataframe and want to eliminate duplicate rows, that have same values, but in different columns:
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df
Out[8]:
a b c d
1 x y e f
2 e f x y
3 w v s t
Rows [1],[2] have the values {x,y,e,f}, but they are arranged in a cross - i.e. if you would exchange columns c,d with a,b in row [2] you would have a duplicate.
I want to drop these lines and only keep one, to have the final output:
df_new
Out[20]:
a b c d
1 x y e f
3 w v s t
How can I efficiently achieve that?
I think you need filter by boolean indexing with mask created by numpy.sort with duplicated, for invert it use ~:
df = df[~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()]
print (df)
a b c d
1 x y e f
3 w v s t
Detail:
print (np.sort(df, axis=1))
[['e' 'f' 'x' 'y']
['e' 'f' 'x' 'y']
['s' 't' 'v' 'w']]
print (pd.DataFrame(np.sort(df, axis=1), index=df.index))
0 1 2 3
1 e f x y
2 e f x y
3 s t v w
print (pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 False
2 True
3 False
dtype: bool
print (~pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated())
1 True
2 False
3 True
dtype: bool
Here's another solution, with a for loop:
data = df.as_matrix()
new = []
for row in data:
if not new:
new.append(row)
else:
if not any([c in nrow for nrow in new for c in row]):
new.append(row)
new_df = pd.DataFrame(new, columns=df.columns)
Use sorting(np.sort) and then get duplicates(.duplicated()) out of it.
Later use that duplicates to drop(df.drop) the required index
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['a','b','c','d'], index=['1','2','3'])
df.loc['1'] = pd.Series({'a':'x','b':'y','c':'e','d':'f'})
df.loc['2'] = pd.Series({'a':'e','b':'f','c':'x','d':'y'})
df.loc['3'] = pd.Series({'a':'w','b':'v','c':'s','d':'t'})
df_duplicated = pd.DataFrame(np.sort(df, axis=1), index=df.index).duplicated()
index_to_drop = [ind for ind in range(len(df_duplicated)) if df_duplicated[ind]]
df.drop(df.index[df_duplicated])

Categories

Resources