Appending duplicates as columns and removing the other rows - python

I have a df with some repeated IDs, like this:
index ID name surname
1 1 a x
2 2 b y
3 1 c z
4 3 d j
I'd like to append the columns of the repeated rows to the right and to remove the "single" rows, like this:
index ID name surname second_name second_surname
1 1 a x c z
What is the most efficient way to do it? (I have many millions of rows)

Try using drop_duplicates, merge and query like so:
df['second_name'] = (df.drop_duplicates(subset='ID')
.reset_index()
.merge(df, on='ID', how='inner', suffixes=('', '_'))
.query("name != name_")
.set_index('level_0')['name_'])
[out]
index ID name second_name
0 1 1 a c
1 2 2 b NaN
2 3 1 c NaN
3 4 3 d NaN
If you only need the single row, use dropna:
df.dropna(subset=['second_name'])
[out]
index ID name second_name
0 1 1 a c

My suggestion involves groupby and should work for an arbitrary number of "additional" names:
df_in = pd.DataFrame({'ID': [1, 2, 1, 3], 'name': ['a', 'b', 'c', 'd']})
grp = df_in.groupby('ID', as_index=True)
df_a = grp.first()
df_b = grp['name'].unique().apply(pd.Series).rename(columns = lambda x: 'name_{:.0f}'.format(x+1)).drop('name_1', axis=1)
df_out = df_a.merge(df_b, how='inner', left_index=True, right_index=True).reset_index(drop=False)

I would try to pivot the dataframe. For that, I will first add a rank column to give the rank of a name for its ID:
df['rank'] = df.groupby('ID').cumcount()
pivoted = df.pivot(index='ID', columns='rank', values='name')
giving:
rank 0 1
ID
1 a c
2 b NaN
3 d NaN
Let us just format it:
pivoted = pivoted.rename_axis(None, axis=1).rename(lambda x: 'name_{}'.format(x),
axis=1).reset_index()
ID name_0 name_1
0 1 a c
1 2 b NaN
2 3 d NaN

Numpy / Pandas
r, i = np.unique(df.ID, return_inverse=True)
j = df.groupby('ID').cumcount()
names = np.empty((len(r), j.max() + 1), object)
names.fill(np.nan)
names[i, j] = df.name
pd.DataFrame(names, r).rename_axis('ID').add_prefix('name_')
name_0 name_1
ID
1 a c
2 b NaN
3 d NaN
Loopy
from itertools import count
from collections import defaultdict
c = defaultdict(count)
d = defaultdict(dict)
for i, n in zip(df.ID, df.name):
d[f'name_{next(c[i])}'][i] = n
pd.DataFrame(d).rename_axis('ID')
name_0 name_1
ID
1 a c
2 b NaN
3 d NaN

Related

Python - lookup value in dataframe and return random corresponding value

df1 has a lot of NaN values.
I have compiled df2 with all unique values for code and name.
I need to replace the NaN code values in df1 with a random code value from df2 where df1 and df2 match on name.
df1 = pd.DataFrame(columns=['ID','name','code'])
df1.ID = [1,2,3,4]
df1.name = ['A','A','B','B']
df1.code = [np.nan,np.nan,np.nan,np.nan]
df2 = pd.DataFrame(columns=['name','code'])
df2.name = ['A','A','A','A','B','B','B','B']
df2.code = ['a','b','c','d','e','f','g','h']
df1
df2
example result
You could use random.sample and pas 2 after joining the values for each group into a list. Then, merge back into the initial dataframe, explode the list and drop_duplicates()
import random
df2 = df2.groupby('name')['code'].apply(lambda x: random.sample(list(x), 2)).reset_index()
df3 = df1[['ID', 'name']].merge(df2).explode('code').drop_duplicates(['name', 'code']).reset_index(drop=True)
df3['ID'] = np.flatnonzero(df3['ID']) + 1
Out[1]:
ID name code
0 1 A d
1 2 A a
2 3 B h
3 4 B f
You could create a dictionary where the keys are names and the values are the possible code, then for each name in df1 sample from the corresponding value:
import random
lookup = df2.groupby('name')['code'].apply(list).to_dict()
df1['code'] = df1['code'].fillna(pd.Series([random.choice(lookup[name]) for name in df1['name']],
index=df1.index))
print(df1)
Output
ID name code
0 1 A b
1 2 A b
2 3 B g
3 4 B g
If sample without replacement is needed you could do:
lst = [s for k, g in df1.groupby('name', as_index=False) for s in random.sample(lookup[k], len(g))]
df1['code'] = df1['code'].fillna(pd.Series(lst, index=df1.index))
print(df1)
Output
ID name code
0 1 A d
1 2 A a
2 3 B e
3 4 B h

Pandas new column from indexing list by row value

I am looking to create a new column in a Pandas data frame with the value of a list filtered by the df row value.
df = pd.DataFrame({'Index': [0,1,3,2], 'OtherColumn': ['a', 'b', 'c', 'd']})
Index OtherColumn
0 a
1 b
3 c
2 d
l = [1000, 1001, 1002, 1003]
Desired output:
Index OtherColumn Value
0 a -
1 b -
3 c 1003
2 d -
My code:
df.loc[df.OtherColumn == 'c', 'Value'] = l[df.Index]
Which returns an error since 'df.Index' is not recognised as a int but as a list (not filter by OtherColumn == 'c').
For R users, I'm looking for:
df[OtherColumn == 'c', Value := l[Index]]
Thanks.
Convert list to numpy array for indexing and then filter by mask in both sides:
m = df.OtherColumn == 'c'
df.loc[m, 'Value'] = np.array(l)[df.Index][m]
print (df)
Index OtherColumn Value
0 0 a NaN
1 1 b NaN
2 3 c 1003.0
3 2 d NaN
Or use numpy.where:
m = df.OtherColumn == 'c'
df['Value'] = np.where(m, np.array(l)[df.Index], '-')
print (df)
Index OtherColumn Value
0 0 a -
1 1 b -
2 3 c 1003
3 2 d -
Or:
df['value'] = np.where(m, df['Index'].map(dict(enumerate(l))), '-')
Use Series.where + Series.map:
df['value']=df['Index'].map(dict(enumerate(l))).where(df['OtherColumn']=='c','-')
print(df)
Index OtherColumn value
0 0 a -
1 1 b -
2 3 c 1003
3 2 d -

pandas hwo to groupby create other columns by counting values of existing columns

I got to know how to do this in R(
How to make new columns by counting up an existing column), but I'd like also to know how it works in python as well.
When the original table is like below
userID cat1 cat2
a f 3
a f 3
a u 1
a m 1
b u 2
b m 1
b m 2
I group them by userID and want it come like
userID cat1_f cat1_m cat1_u cat2_1 cat2_2 cat2_3
a 2 1 1 2 0 1
b 0 2 1 1 2 0
Use melt with GroupBy.size and unstack:
df = (df.melt('userID')
.groupby(['userID','variable','value'])
.size()
.unstack([1,2], fill_value=0))
#python 3.6+
df.columns = [f'{a}_{b}' for a, b in df.columns]
#python bellow
#df.columns = ['{}_{}'.format(a,b) for a, b in df.columns]
df = df.reset_index()
print (df)
RangeIndex(start=0, stop=7, step=1)
userID cat1_f cat1_m cat1_u cat2_1 cat2_3 cat2_2
0 a 2 1 1 2 2 0
1 b 0 2 1 1 0 2
Alternative with crosstab:
df = df.melt('userID')
df = pd.crosstab(df['userID'], [df['variable'], df['value']])
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()

combining columns in pandas dataframe

I have the following dataframe:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan],
'user_b':['A','B',np.nan,'D']
})
I would like to create a new column called user and have the resulting dataframe:
What's the best way to do this for many users?
Use forward filling missing values and then select last column by iloc:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan,np.nan],
'user_b':['A','B',np.nan,'D',np.nan]
})
df['user'] = df.ffill(axis=1).iloc[:, -1]
print (df)
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
4 NaN NaN NaN
use .apply method:
In [24]: df = pd.DataFrame({'user_a':['A','B','C',np.nan],'user_b':['A','B',np.nan,'D']})
In [25]: df
Out[25]:
user_a user_b
0 A A
1 B B
2 C NaN
3 NaN D
In [26]: df['user'] = df.apply(lambda x: [i for i in x if not pd.isna(i)][0], axis=1)
In [27]: df
Out[27]:
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D

Selecting groups fromed by groupby function

My dataframe:
df1
group ordercode quantity
0 A 1
B 3
1 C 1
E 2
D 1
I have formed each group bygroupby function.
I need to extract the data by using group number.
My desired ouput.
In:get group 0
out:
ordercode quantity
A 1
B 3
or
group ordercode quantity
0 A 1
B 3
any suggestion would be appreciated.
Use DataFrame.xs, also is possible use parameter drop_level=False:
#if need remove original level
df1 = df.xs(0)
print (df1)
quantity
ordercode
A 1
B 3
#if avoid remove original level
df1 = df.xs(0, drop_level=False)
print (df1)
quantity
group ordercode
0 A 1
B 3
EDIT:
dfs = [df1, df2, df3]
dfs = [x[x['group'] == 0] for x in dfs]
print (dfs)
In [131]: df.loc[pd.IndexSlice[0,:]]
Out[131]:
quantity
ordercode
A 1
B 3
or
In [130]: df.loc[pd.IndexSlice[0,:], :]
Out[130]:
quantity
group ordercode
0.0 A 1
B 3
You can use GroupBy.get_group after specifying columns. Here's a demo:
df = pd.DataFrame({'A': ['foo', 'bar'] * 3,
'B': np.random.rand(6),
'C': np.arange(6)})
gb = df.groupby('A')
print(gb[gb.obj.columns].get_group('bar'))
A B C
1 bar 0.523248 1
3 bar 0.575946 3
5 bar 0.318569 5

Categories

Resources