combining columns in pandas dataframe

combining columns in pandas dataframe - python

I have the following dataframe:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan],
'user_b':['A','B',np.nan,'D']
})
I would like to create a new column called user and have the resulting dataframe:
What's the best way to do this for many users?

Use forward filling missing values and then select last column by iloc:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan,np.nan],
'user_b':['A','B',np.nan,'D',np.nan]
})
df['user'] = df.ffill(axis=1).iloc[:, -1]
print (df)
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
4 NaN NaN NaN

use .apply method:
In [24]: df = pd.DataFrame({'user_a':['A','B','C',np.nan],'user_b':['A','B',np.nan,'D']})
In [25]: df
Out[25]:
user_a user_b
0 A A
1 B B
2 C NaN
3 NaN D
In [26]: df['user'] = df.apply(lambda x: [i for i in x if not pd.isna(i)][0], axis=1)
In [27]: df
Out[27]:
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D

Related

How can I return the value of a column in a new column based on conditions with python

I have a dataframe with three columns
a b c
[1,0,2]
[0,3,2]
[0,0,2]
and need to create a fourth column based on a hierarchy as follows:
If column a has value then column d = column a
if column a has no value but b has then column d = column b
if column a and b have no value but c has then column d = column c
a b c d
[1,0,2,1]
[0,3,2,3]
[0,0,2,2]
I'm quite the beginner at python and have no clue where to start.
Edit: I have tried the following but they all will not return a value in column d if column a is empty or None
df['d'] = df['a']
df.loc[df['a'] == 0, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
df['d'] = df['a']
df.loc[df['a'] == None, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
df['d']=np.where(df.a!=0, df.a,\
np.where(df.b!=0,\
df.b, df.c)

A simple one-liner would be,
df['d'] = df.replace(0, np.nan).bfill(axis=1)['a'].astype(int)
Step by step visualization
Convert no value to NaN
a b c
0 1.0 NaN 2
1 NaN 3.0 2
2 NaN NaN 2
Now backward fill the values along rows
a b c
0 1.0 2.0 2.0
1 3.0 3.0 2.0
2 2.0 2.0 2.0
Now select the required column, i.e 'a' and create a new column 'd'
Output
a b c d
0 1 0 2 1
1 0 3 2 3
2 0 0 2 2

import numpy as np
import pandas as pd
df = pd.DataFrame([[1,0,2], [0,3,2], [0,0,2]], columns = ('a','b','c'))
print(df)
df['d'] = df['a']
df.loc[df['a'] == 0, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
print(df)

Try this (df is your dataframe)
df['d']=np.where(df.a!=0 and df.a is not None, df.a, np.where(df.b!=0 and df.b is not None, df.b, df.c))
>>> print(df)
a b c d
0 1 0 2 1
1 0 3 2 3
2 0 0 2 2

How do I replace pandas rows with values of another dataframe for all instances of the value in the first df?

I have two dataframes:
df1=
A B C
a 1 3
b 2 3
c 2 2
a 1 4
df2=
A B C
a 1 3.5
Now I need to replace all occurrences of a in df1 (2 in this case) with a in df2, leaving b and c unchanged. The final dataframe should be:
df_final=
A B C
b 2 3
c 2 2
a 1 3.5

Do you mean:
df_final = pd.concat((df1[df1['A'].ne('a')], df2))
Or if you have several values like a:
list_special = ['a']
df_final = pd.concat((df1[~df1['A'].isin(list_special)], df2))

If df2 just has the average of duplicated values, you can do df1.groupby(["A", "B"]).mean().reset_index()
Otherwise, you can do something like this:
In [27]: df = df1.groupby(["A", "B"]).first().merge(df2, how="left", on=["A", "
...: B"])
...: df["C"] = df["C_y"].fillna(df["C_x"])
...: df = df[["A", "B", "C"]]
...: df
Out[27]:
A B C
0 a 1 3.5
1 b 2 3.0
2 c 2 2.0

Drop nan rows unless string value in separate column - Pandas

I want to drop rows containing NaN values except if a separate column contains a specific string. Using the df below, I want to drop rows if NaN in Code2, Code3 unless the string A is in Code1.
df = pd.DataFrame({
'Code1' : ['A','A','B','B','C','C'],
'Code2' : ['B',np.nan,'A','B',np.nan,'B'],
'Code3' : ['C',np.nan,'C','C',np.nan,'A'],
})
def dropna(df, col):
if col == np.nan:
df = df.dropna()
return df
df = dropna(df, df['Code2'])
Intended Output:
Code1 Code2 Code3
0 A B C
1 A NaN NaN
2 B A C
3 B B C
4 C B A

Use DataFrame.notna + DataFrame.all to performance a boolean indexing:
new_df=df[df.Code1.eq('A')|df.notna().all(axis=1)]
print(new_df)
Code1 Code2 Code3
0 A B C
1 A NaN NaN
2 B A C
3 B B C
5 C B A

Appending duplicates as columns and removing the other rows

I have a df with some repeated IDs, like this:
index ID name surname
1 1 a x
2 2 b y
3 1 c z
4 3 d j
I'd like to append the columns of the repeated rows to the right and to remove the "single" rows, like this:
index ID name surname second_name second_surname
1 1 a x c z
What is the most efficient way to do it? (I have many millions of rows)

Try using drop_duplicates, merge and query like so:
df['second_name'] = (df.drop_duplicates(subset='ID')
.reset_index()
.merge(df, on='ID', how='inner', suffixes=('', '_'))
.query("name != name_")
.set_index('level_0')['name_'])
[out]
index ID name second_name
0 1 1 a c
1 2 2 b NaN
2 3 1 c NaN
3 4 3 d NaN
If you only need the single row, use dropna:
df.dropna(subset=['second_name'])
[out]
index ID name second_name
0 1 1 a c

My suggestion involves groupby and should work for an arbitrary number of "additional" names:
df_in = pd.DataFrame({'ID': [1, 2, 1, 3], 'name': ['a', 'b', 'c', 'd']})
grp = df_in.groupby('ID', as_index=True)
df_a = grp.first()
df_b = grp['name'].unique().apply(pd.Series).rename(columns = lambda x: 'name_{:.0f}'.format(x+1)).drop('name_1', axis=1)
df_out = df_a.merge(df_b, how='inner', left_index=True, right_index=True).reset_index(drop=False)

I would try to pivot the dataframe. For that, I will first add a rank column to give the rank of a name for its ID:
df['rank'] = df.groupby('ID').cumcount()
pivoted = df.pivot(index='ID', columns='rank', values='name')
giving:
rank 0 1
ID
1 a c
2 b NaN
3 d NaN
Let us just format it:
pivoted = pivoted.rename_axis(None, axis=1).rename(lambda x: 'name_{}'.format(x),
axis=1).reset_index()
ID name_0 name_1
0 1 a c
1 2 b NaN
2 3 d NaN

Numpy / Pandas
r, i = np.unique(df.ID, return_inverse=True)
j = df.groupby('ID').cumcount()
names = np.empty((len(r), j.max() + 1), object)
names.fill(np.nan)
names[i, j] = df.name
pd.DataFrame(names, r).rename_axis('ID').add_prefix('name_')
name_0 name_1
ID
1 a c
2 b NaN
3 d NaN
Loopy
from itertools import count
from collections import defaultdict
c = defaultdict(count)
d = defaultdict(dict)
for i, n in zip(df.ID, df.name):
d[f'name_{next(c[i])}'][i] = n
pd.DataFrame(d).rename_axis('ID')
name_0 name_1
ID
1 a c
2 b NaN
3 d NaN

Make sure Column B = a certain value when Column A is Null - Python

I want to make sure that when Column A is NULL (in csv), or NaN (in dataframe), Column B is "Cash".
I've tried this:
check = df[df['A'].isnull()]['B']
check = check.to_string(index=False)
if "Cash" not in check:
print "Column A Fail"
else:
print "Column A Pass!"
But it is not working.
any suggestions?
I also need to make sure that it doesn't treat '0' as NaN

UPDATE:
my goal is not to assign 'Cash', but rather to make sure that it's
already there as a quality check
In [40]: df
Out[40]:
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN Cash
In [41]: df.query("A != A and B != 'Cash'")
Out[41]:
A B
0 NaN a
or using boolean indexing:
In [42]: df.loc[df.A.isnull() & (df.B != 'Cash')]
Out[42]:
A B
0 NaN a
OLD answer:
Alternative solution:
In [23]: df.B = np.where(df.A.isnull(), 'Cash', df.B)
In [24]: df
Out[24]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
another solution:
In [31]: df = df.mask(df.A.isnull(), df.assign(B='Cash'))
In [32]: df
Out[32]:
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash

Use loc to assign where A is null.
df.loc[df['A'].isnull(), 'B'] = 'Cash'
example
df = pd.DataFrame(dict(
A=[np.nan, 1, 2, np.nan],
B=['a', 'b', 'c', 'd']
))
print(df)
A B
0 NaN a
1 1.0 b
2 2.0 c
3 NaN d
Then do
df.loc[df['A'].isnull(), 'B'] = 'Cash'
print(df)
A B
0 NaN Cash
1 1.0 b
2 2.0 c
3 NaN Cash
check if all B are 'Cash' where A is null*
(df.loc[df.A.isnull(), 'B'] == 'Cash').all()

According to logic rules, P=>Q is (not P) or Q. So
(~df.A.isnull()|(df.B=="Cash")).all()
check all the lines.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

combining columns in pandas dataframe - python

I have the following dataframe: df = pd.DataFrame({ 'user_a':['A','B','C',np.nan], 'user_b':['A','B',np.nan,'D'] }) I would like to create a new column called user and have the resulting dataframe: What's the best way to do this for many users?

Use forward filling missing values and then select last column by iloc: df = pd.DataFrame({ 'user_a':['A','B','C',np.nan,np.nan], 'user_b':['A','B',np.nan,'D',np.nan] }) df['user'] = df.ffill(axis=1).iloc[:, -1] print (df) user_a user_b user 0 A A A 1 B B B 2 C NaN C 3 NaN D D 4 NaN NaN NaN

Related

How can I return the value of a column in a new column based on conditions with python

How do I replace pandas rows with values of another dataframe for all instances of the value in the first df?

Drop nan rows unless string value in separate column - Pandas

Appending duplicates as columns and removing the other rows

Make sure Column B = a certain value when Column A is Null - Python

Categories

Resources