Starting with this dataframe df:
df = pd.DataFrame({'id':[1,2,3,4],'a':['on','on','off','off'], 'b':['on','off','on','off']})
a b id
0 on on 1
1 on off 2
2 off on 3
3 off off 4
what I would like to achieve is a column result with results from the 'on' and 'off' selection of the columns. Expected output is:
a b id result
0 on on 1 [a,b]
1 on off 2 [a]
2 off on 3 [b]
3 off off 4 []
so basically I have to select the 'on' values in columns (except id) and then keep the resulting column names into lists. My first attemp was using pivot_table:
d = pd.pivot_table(df, index='id', columns=?, values=?)
but I am stuck on how to put the selection into the values and the new column into the columns args.
For me works create nested lists and then select first value of lists by str[0]:
df['res'] = df[['a','b']].eq('on').apply(lambda x: [x.index.values[x]], axis=1).str[0]
print (df)
a b id res
0 on on 1 [a, b]
1 on off 2 [a]
2 off on 3 [b]
3 off off 4 []
Or create tuple first and then cast to lists:
df['res'] = df[['a','b']].eq('on')
.apply(lambda x: tuple(x.index.values[x]), axis=1).apply(list)
print (df)
a b id res
0 on on 1 [a, b]
1 on off 2 [a]
2 off on 3 [b]
3 off off 4 []
Instead of pivot table you can also use
df['result'] = df.iloc[:,0:2].eq('on').apply(lambda x: tuple(df.columns[0:2][x]), axis=1)
Output :
a b id result
0 on on 1 (a, b)
1 on off 2 (a,)
2 off on 3 (b,)
3 off off 4 ()
or you can using eq and mul
df['res']=(df[['a','b']].eq('on').mul(['a','b'])).values.tolist()
Out[824]:
a b id res
0 on on 1 [a, b]
1 on off 2 [a, ]
2 off on 3 [, b]
3 off off 4 [, ]
Try this:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4],'a':['on','on','off','off'], 'b':['on','off','on','off']})
stringList = []
for i in range(0,df.shape[0]):
if df['a'][i] == 'on' and df['b'][i] == 'on':
stringList.append('[a,b]')
elif df['a'][i] == 'on' and df['b'][i] == 'off':
stringList.append('[a]')
elif df['a'][i] == 'off' and df['b'][i] == 'on':
stringList.append('[b]')
else:
stringList.append('[]')
df['result'] = stringList
print df
Related
I'm aiming to return rows in a pandas df that contain two specific values grouped by a separate column. Using below, I'm grouping by Num and aiming to return rows where B is present but not A for each unique group.
If neither A nor B is assigned to a grouped value then continue. I only want to return the rows where B is present but not A.
import pandas as pd
df = pd.DataFrame({
'Num' : [1,1,2,2,2,2,3,3,4,4,4,4],
'Label' : ['X','Y','X','B','B','B','A','B','B','A','B','X'],
})
df = df.loc[(df['Label'] == 'A') | (df['Label'] == 'B')]
df = df.groupby('Num').filter(lambda x: any(x['Label'] == 'A'))
df = df.groupby('Num').filter(lambda x: any(x['Label'] == 'B'))
intended output:
Num Label
2 2 B
3 2 B
4 2 B
5 2 B
You can filter if all values per groups are B by GroupBy.transform with GroupBy.all:
df1 = df.loc[(df['Label'] == 'A') | (df['Label'] == 'B')]
df1 = df1[(df1['Label'] == 'B').groupby(df1['Num']).transform('all')]
print (df1)
Num Label
3 2 B
4 2 B
5 2 B
If need fitler original column Num use:
df = df[df['Num'].isin(df1['Num'])]
print (df)
Num Label
2 2 X
3 2 B
4 2 B
5 2 B
Another approach is filter by numpy.setdiff1d:
num = np.setdiff1d(df.loc[(df['Label'] == 'B'), 'Num'],
df.loc[(df['Label'] == 'A'), 'Num'])
df = df[df['Num'].isin(num)]
print (df)
Num Label
2 2 X
3 2 B
4 2 B
5 2 B
If I have 2 panda df, I want to let A left join B on 2 conditions, 1) A.id = B.id 2) A.x in B.x_set. Can anyone help me on this? :
A:
id x
1 a
2 b
3 c
B:
id x_set detail
1 a,b,c x
1 d y
2 a,c z
2 d m
2 b n
3 a i
3 b,c j
The final table should be like this:
id x detail
1 a x
2 b n
3 c j
If using pandas==0.25, you can:
Transform the values to list
Explode the list into new rows
Merge back with A using pd.merge
B['x_set'] = B['x_set'].apply(lambda x: x.split(','))
B = B.explode('x_set')
A.merge(B, left_on=['id','x'], right_on=['id','x_set'])
Out[11]:
id x x_set detail
0 1 a a x
1 2 b b n
2 3 c c j
If pandas<0.25:
Transform the values to list
Get flatten list of x
Create a new dataframe with the new list
Pass the id and detail using pd.Series.repeat
Merge with A (we can use the same keys here)
B['x_set'] = B['x_set'].apply(lambda x: x.split(','))
len_set = B['x_set'].apply(len).values
values = B['x_set'].values.flatten().tolist()
flat_results = [item for sublist in values for item in sublist]
new_B = pd.DataFrame(flat_results, columns=['x'])
new_B['id'] = B['id'].repeat(len_set).values
new_B['detail'] = B['detail'].repeat(len_set).values
A.merge(new_B, on=['id','x'])
Out[32]:
id x detail
0 1 a x
1 2 b n
2 3 c j
I have this dataframe:
dfx = pd.DataFrame([[1,2],['A','B'],[['C','D'],'E']],columns=list('AB'))
A B
0 1 2
1 A B
2 [C, D] E
... that I want to transform in ...
A B
0 1 2
1 A B
2 C E
3 D E
... adding a row for each value contained in column A if it's a list.
Which is the most pythonic way?
And vice versa, if I want to group by a column (let's say B) and have in column A a list of the grouped values? (so the opposite that the example above)
Thanks in advance,
Gianluca
You have mixed dataframe - int with str and list values (very problematic because many functions raise errors), so first convert all numeric to str by where and mask is by to_numeric with parameter errors='coerce' which convert non numeric to NaN:
dfx.A = dfx.A.where(pd.to_numeric(dfx.A, errors='coerce').isnull(), dfx.A.astype(str))
print (dfx)
A B
0 1 2
1 A B
2 [C, D] E
and then create new DataFrame by np.repeat and flat values of lists by chain.from_iterable:
df = pd.DataFrame({
"B": np.repeat(dfx.B.values, dfx.A.str.len()),
"A": list(chain.from_iterable(dfx.A))})
print (df)
A B
0 1 2
1 A B
2 C E
3 D E
Pure pandas solution convert column A to list and then create new DataFrame.from_records. Then drop original column A and join stacked df:
df = pd.DataFrame.from_records(dfx.A.values.tolist(), index = dfx.index)
df = dfx.drop('A', axis=1).join(df.stack().rename('A')
.reset_index(level=1, drop=True))[['A','B']]
print (df)
A B
0 1 2
1 A B
2 C E
2 D E
If need lists use groupby and apply tolist:
print (df.groupby('B')['A'].apply(lambda x: x.tolist()).reset_index())
B A
0 2 [1]
1 B [A]
2 E [C, D]
but if need list only if length of values is more as 1 is necessary if..else:
print (df.groupby('B')['A'].apply(lambda x: x.tolist() if len(x) > 1 else x.values[0])
.reset_index())
B A
0 2 1
1 B A
2 E [C, D]
I have two dataframes. (a,b,c,d ) and (i,j,k) are the name columns dataframes
df1 =
a b c d
0 1 2 3
0 1 2 3
0 1 2 3
df2 =
i j k
0 1 2
0 1 2
0 1 2
I want to select the entries that df1 is df2
I want to obtain
df1=
a b c
0 1 2
0 1 2
0 1 2
You can use isin for compare df1 with each column of df2:
dfs = []
for i in range(len(df2.columns)):
df = df1.isin(df2.iloc[:,i])
dfs.append(df)
Then concat all mask together:
mask = pd.concat(dfs).groupby(level=0).sum()
print (mask)
a b c d
0 True True True False
1 True True True False
2 True True True False
Apply boolean indexing:
print (df1.ix[:, mask.all()])
a b c
0 0 1 2
1 0 1 2
2 0 1 2
Doing a column-wise comparison would give the desired result:
df1 = df1[(df1.a == df2.i) & (df1.b == df2.j) & (df1.c == df2.k)][['a','b','c']]
You get only those rows from df1 where the values of the first three columns are identical to those of df2.
Then you just select the rows 'a','b','c' from df1.
I want to group my data set and enrich it with a formatted representation of the aggregated information.
This is my data set:
h = ['A', 'B', 'C']
d = [["a", "x", 1], ["a", "y", 2], ["b", "y", 4]]
rows = pd.DataFrame(d, columns=h)
A B C
0 a x 1
1 a y 2
2 b y 4
I create a pivot table to generate 0 for missing values:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
C
B x y
A
a 1 2
b 0 4
I groupy by A to remove dimension B:
wanted = rows.groupby("A").sum()
C
A
a 3
b 4
I try to add a column with the string representation of the aggregate details:
wanted["D"] = pivot["C"].applymap(lambda vs: reduce(lambda a,b: str(a)+"+"+str(b), vs.values))
AttributeError: ("'int' object has no attribute 'values'", u'occurred at index x')
It seems that I don't understand applymap.
What I want to achieve is:
C D
A
a 3 1+2
b 4 0+4
You can first remove [] from parameters in pivot_table, so you remove Multiindex from columns:
pivot = pd.pivot_table(rows,index="A", values="C", columns="B",fill_value=0)
Then sum values by columns:
pivot['C'] = pivot.sum(axis=1)
print (pivot)
B x y C
A
a 1 2 3
b 0 4 4
Cast by astype int columns x and y to str and output to D:
pivot['D'] = pivot['x'].astype(str) + '+' + pivot['y'].astype(str)
print (pivot)
B x y C D
A
a 1 2 3 1+2
b 0 4 4 0+4
Last remove column name by rename_axis (new in pandas 0.18.0) and drop unnecessary columns:
pivot = pivot.rename_axis(None, axis=1).drop(['x', 'y'], axis=1)
print (pivot)
C D
A
a 3 1+2
b 4 0+4
But if want Multiindex in columns:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
pivot['E'] = pivot["C"].sum(1)
print (pivot)
C E
B x y
A
a 1 2 3
b 0 4 4
pivot["D"] = pivot[('C','x')].astype(str) + '+' + pivot[('C','y')].astype(str)
print (pivot)
C E D
B x y
A
a 1 2 3 1+2
b 0 4 4 0+4
pivot = pivot.rename_axis((None,None), axis=1).drop('C', axis=1).rename(columns={'E':'C'})
pivot.columns = pivot.columns.droplevel(-1)
print (pivot)
C D
A
a 3 1+2
b 4 0+4
EDIT:
Another solution with groupby and MultiIndex.droplevel:
pivot = pd.pivot_table(rows,index=["A"], values=["C"], columns=["B"],fill_value=0)
#remove top level of Multiindex in columns
pivot.columns = pivot.columns.droplevel(0)
print (pivot)
B x y
A
a 1 2
b 0 4
wanted = rows.groupby("A").sum()
wanted['D'] = pivot['x'].astype(str) + '+' + pivot['y'].astype(str)
print (wanted)
C D
A
a 3 1+2
b 4 0+4