How to create a column using first 2 letters from other columns but not including NaN? E.g. I have 3 columns
a=pd.Series(['Eyes', 'Ear', 'Hair', 'Skin'])
b=pd.Series(['Hair', 'Liver', 'Eyes', 'NaN'])
c=pd.Series(['NaN', 'Skin', 'NaN', 'NaN'])
df=pd.concat([a, b, c], axis=1)
df.columns=['First', 'Second', 'Third']
Now I want to create a 4th column that would combine first 2 letters from 'First', 'Second' and 'Third' after sorting (so that Ear comes before Hair irrespective of the column). But it would skip NaN values.
The final output for the fourth column would would look something like:
Fourth = pd.Series(['EyHa', 'EaLiSk', 'EyHa', 'Sk'])
If NaN is np.nan - missing value:
a=pd.Series(['Eyes', 'Ear', 'Hair', 'Skin'])
b=pd.Series(['Hair', 'Liver', 'Eyes', np.nan])
c=pd.Series([np.nan, 'Skin', np.nan, np.nan])
df=pd.concat([a, b, c], axis=1)
df.columns=['First', 'Second', 'Third']
df['new'] = df.apply(lambda x: ''.join(sorted([y[:2] for y in x if pd.notnull(y)])), axis=1)
Another solution:
df['new'] = [''.join([y[:2] for y in x]) for x in np.sort(df.fillna('').values, axis=1)]
#alternative
#df['new'] = [''.join(sorted([y[:2] for y in x if pd.notnull(y)])) for x in df.values]
print (df)
First Second Third new
0 Eyes Hair NaN EyHa
1 Ear Liver Skin EaLiSk
2 Hair Eyes NaN EyHa
3 Skin NaN NaN Sk
If NaN is string:
df['new'] = df.apply(lambda x: ''.join(sorted([y[:2] for y in x if y != 'NaN'])), axis=1)
df['new'] = [''.join(sorted([y[:2] for y in x if y != 'NaN'])) for x in df.values]
Related
PROBLEM STATEMENT:
I'm trying to join multiple pandas data frame columns, based on row index, to a single column already in the data frame. Issues seem to happen when the data in a column is read in as np.nan.
EXAMPLE:
Original Data frame
time
msg
d0
d1
d2
0
msg0
a
b
c
1
msg1
x
x
x
2
msg0
a
b
c
3
msg2
1
2
3
What I want, if I were to filter for msg0 and msg2
time
msg
d0
d1
d2
0
msg0
abc
NaN
NaN
1
msg1
x
x
x
2
msg0
abc
NaN
Nan
3
msg2
123
NaN
NaN
MY ATTEMPT:
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
'd0': ['a', 'x', 'a', '1'],
'd1': ['b', 'x', 'b', '2'],
'd2': ['c', 'x', np.nan, '3']})
mask = df.index[((df['msg'] == "msg0") |
(df['msg'] == "msg1") |
(df['msg'] == "msg3"))].tolist()
# Is there a better way to combine all columns after a certian point?
# This works fine here but has issues when importing large data sets.
# the 'd0' will be set to NaN too, I think this is due to np.nan
# being set to some columns values when imported.
df.loc[mask, 'd0'] = df['d0'] + df['d1'] + df['d2']
df.iloc[mask, 3:] = "NaN"
The approach might be somewhat similar to #mozway's answer I will make it more detailed to be easier to follow.
1- Define your target columns and messages (just to make it easier to deal with)
# the messages to filter
msgs = ["msg0", "msg2"]
# the columns to filter
columns = df.columns.drop(['time', 'msg'])
# the column to contain the result
total_col = ["d0"]
2- Mask the rows based on the (msgs) column value
mask = df['msg'].isin(msgs)
3- Find the value of the combined values
# a- mask the dataframe to the target columns and rows.
# b- apply ''.join() to join all the column values
# c- to join columns not rows apply on axis = 1
new_total_col = df.loc[mask, columns].apply(lambda x: ''.join(x.dropna().astype(str)), axis=1)
4- Set all target columns and rows to np.nan and redefine the values of the "total" column
df.loc[mask, columns] = np.nan
df.loc[mask, total_col] = new_total_col
Result
time msg d0 d1 d2
0 0 msg0 abc NaN NaN
1 1 msg1 x x x
2 2 msg0 ab NaN NaN
3 3 msg2 123 NaN NaN
You can use:
cols = ['d0', 'd1', 'd2']
# get the rows matching the msg condition
m = df['msg'].isin(['msg0', 'msg2'])
# get relevant columns
# concatenate the non-NaN value
# update as DataFrame to assign NaN is the non-first columns
df.loc[m, cols] = (df
.loc[m, cols]
.agg(lambda r: ''.join(r.dropna()), axis=1)
.rename(cols[0]).to_frame()
)
print(df)
Output:
time msg d0 d1 d2
0 0 msg0 abc NaN NaN
1 1 msg1 x x x
2 2 msg0 ab NaN NaN
3 3 msg2 123 NaN NaN
I have a dataframe like the following:
and I want to group the answers like the following
I tried to use multindex, but it won’t work.
You can use pandas.MultiIndex.from_array to manually craft your custom index:
new_level = ['GROUP1', 'GROUP1', 'GROUP1', 'GROUP2', 'GROUP2', 'GROUP3', 'GROUP3']
df.columns = pd.MultiIndex.from_arrays([new_level, df.columns])
example input:
A B C D E
0 X X X X X
output:
>>> df.columns = pd.MultiIndex.from_arrays([[1,1,2,2,3], df.columns])
>>> df
1 2 3
A B C D E
0 X X X X X
Let's say that I have the following pd.DataFrame.
import pandas as pd
import numpy as np
data = {'number': [1, 1, 1, 2], 'q':[np.nan, 2, np.nan, 1], 'letter': ['alpha', 'beta', 'gamma', 'alpha']}
df = pd.DataFrame(data)
number q letter
0 1 NaN alpha
1 1 2.0 beta
2 1 NaN gamma
3 2 1.0 alpha
What I want to do is to aggregate by number and create a list with all the letters and apply a filter based on the value of the q.
If I do this:
df.groupby('number').agg({"letter": lambda w: list(w) }) will yield:
letter
number
1 [alpha, beta, gamma]
2 [alpha]
But I want to include only the the columns such that the corresponding q value is not NaN, i.e.
number letter
0 1 [beta]
1 2 [alpha]
Edit: I would appreciate a more generic solution (not just if we have NaN values), but if we want to specify the value of q as a threshold of what is going to be included or not.
I think to need DataFrame.dropna:
df1 = df.dropna().groupby('number').agg({"letter": lambda w: list(w)})
If want specify column for remove missing values:
df1 = df.dropna(subset=['q']).groupby('number').agg({"letter": lambda w: list(w)})
print (df1)
letter
number
1 [beta]
2 [alpha]
EDIT:
You can filter also by query:
df1 = df.query("q > 0").groupby('number').agg({"letter": lambda w: list(w)})
Or boolean indexing:
df1 = df[df['q'] > 0].groupby('number').agg({"letter": lambda w: list(w)})
df1 = df[df['q'].notnull()].groupby('number').agg({"letter": lambda w: list(w)})
EDIT1:
Filtering is possible also in function, for avoid lost non matched groups:
def f(x):
return x.loc[x['q'] > 1, 'letter'].tolist()
df2 = df.groupby('number').apply(f).reset_index(name='val')
print (df2)
number val
0 1 [beta]
1 2 []
df1 = df[df['q'] > 1].groupby('number').agg({"letter": lambda w: list(w)})
print (df1)
letter
number
1 [beta]
I am working with a pandas DataFrame looking as follows:
df = pd.DataFrame(
[['There are # people', '3', np.nan], ['# out of # people are there', 'Five', 'eight'],
['Only # are here', '2', np.nan], ['The rest is at home', np.nan, np.nan]])
resulting in:
0 1 2
0 There are # people 3 NaN
1 # out of # people are there Five eight
2 Only # are here 2 NaN
3 The rest is at home NaN NaN
I would like to replace the # placeholders with the varying strings in columns 1 and 2, resulting in:
0 There are 3 people
1 Five out of eight people are there
2 Only 2 are here
3 The rest is at home
How could I achieve this?
Using string format
df=df.replace({'#':'%s',np.nan:'NaN'},regex=True)
l=[]
for x , y in df.iterrows():
if y[2]=='NaN' and y[1]=='NaN':
l.append(y[0])
elif y[2]=='NaN':
l.append(y[0] % (y[1]))
else:
l.append(y[0] % (y[1], y[2]))
l
Out[339]:
['There are 3 people',
'Five out of eight people are there',
'Only 2 are here',
'The rest is at home']
A more concise way to do it.
cols = df.columns
df[cols[0]] = df.apply(lambda x: x[cols[0]].replace('#',str(x[cols[1]]),1) if x[cols[1]]!=np.NaN else x,axis=1)
print(df.apply(lambda x: x[cols[0]].replace('#',str(x[cols[2]]),1) if x[cols[2]]!=np.NaN else x,axis=1))
Out[12]:
0 There are 3 people
1 Five out of eight people are there
2 Only 2 are here
3 The rest is at home
Name: 0, dtype: object
If you need to do this for even more columns
cols = df.columns
for i in range(1, len(cols)):
df[cols[0]] = df.apply(lambda x: x[cols[0]].replace('#',str(x[cols[i]]),1) if x[cols[i]]!=np.NaN else x,axis=1)
print(df[cols[0]])
A generic replace function in case you may have more values to add:
Replaces all instances if a given character in a string using a list of values (just two in your case but it can handle more)
def replace_hastag(text, values, replace_char='#'):
for v in values:
if v is np.NaN:
return text
else:
text = text.replace(replace_char, str(v), 1)
return text
df['text'] = df.apply(lambda r: replace_hastag(r[0], values=[r[1], r[2]]), axis=1)
Result
In [79]: df.text
Out[79]:
0 There are 3 people
1 Five out of eight people are there
2 Only 2 are here
3 The rest is at home
Name: text, dtype: object
I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.
The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)
you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps
Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()
You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6