Splitting Pandas column into multiple columns - python

Is there a way in Pandas to split a column into multiple columns?
I have a columns in a dataframe where the contents are as follows:
a
[c,a]
b
I would like to split this into:
colA colB colC
a nan nan
a nan c
a b nan
Please note the order of variables in the 2nd row in the original column.
Thanks

Consider the series s
s = pd.Series(['a', ['c', 'a'], 'b'])
s
0 a
1 [c, a]
2 b
dtype: object
Use pd.Series and '|'.join to magically turn into concatenated pipe separated strings. Use str.get_dummies to get array of zeros and ones. Multiply that by the columns to replace ones with column values. where masks the zeros and replaces with np.NaN.
d1 = s.apply(lambda x: '|'.join(pd.Series(x))).str.get_dummies()
d1.mul(d1.columns.values).where(d1.astype(bool))
a b c
0 a NaN NaN
1 a NaN c
2 NaN b NaN
PROJECT/KILL
import itertools
n = len(s)
i = np.arange(n).repeat([len(x) if hasattr(x, '__len__') else 1 for x in s])
j, u = pd.factorize(list(itertools.chain(*s)))
m = u.size
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
pd.DataFrame(np.where(b, u, np.NaN), columns=u)
a b c
0 a NaN NaN
1 a NaN c
2 NaN b NaN
Timing
%%timeit
d1 = s.apply(lambda x: '|'.join(pd.Series(x))).str.get_dummies()
d1.mul(d1.columns.values).where(d1.astype(bool))
100 loops, best of 3: 2.58 ms per loop
%%timeit
n = len(s)
i = np.arange(n).repeat([len(x) if hasattr(x, '__len__') else 1 for x in s])
j, u = pd.factorize(list(itertools.chain(*s)))
m = u.size
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
pd.DataFrame(np.where(b, u, np.NaN), columns=u)
1000 loops, best of 3: 287 µs per loop
%%timeit
s.apply(pd.Series)\
.stack().str.get_dummies().sum(level=0)\
.pipe(lambda x: x.mul(x.columns.values))\
.replace('',np.nan)\
.add_prefix('col')
100 loops, best of 3: 4.24 ms per loop

First stack the lists in the col column, get dummies for each element, and then transform them to a,b,c. Finally rename the columns.
df.col.apply(pd.Series)\
.stack().str.get_dummies().sum(level=0)\
.pipe(lambda x: x.mul(x.columns.values))\
.replace('',np.nan)\
.add_prefix('col')
Out[204]:
cola colb colc
0 a NaN NaN
1 a NaN c
2 NaN b NaN

Assuming you get the column out as a series called s.
s = pd.Series(['a', ['c', 'a'], 'b'])
pd.DataFrame({"col" + x.upper(): s.apply(lambda n: x if x in n else np.NaN)
for x in ['a', 'b', 'c']})

Related

Use dataframe column containing "column name strings", to return values from dataframe based on column name and index without using .apply()

I have a dataframe as follows:
df=pandas.DataFrame()
df['A'] = numpy.random.random(10)
df['B'] = numpy.random.random(10)
df['C'] = numpy.random.random(10)
df['Col_name'] = numpy.random.choice(['A','B','C'],size=10)
I want to obtain an output that uses 'Col_name' and the respective index of the dataframe row to lookup the value in the dataframe.
I can get the desired output this with .apply() follows:
df['output'] = df.apply(lambda x: x[ x['Col_name'] ], axis=1)
.apply() is slow over a large dataframe with it iterating row by row. Is there an obvious solution in pandas that is faster/vectorised?
You can also pick each column name (or give list of possible names) and then apply it as mask to filter your dataframe then pick values from desired column and assign them to all rows matching the mask. Then repeat this for another coulmn.
for column_name in df: #or: for column_name in ['A', 'B', 'C']
df.loc[df['Col_name']==column_name, 'output'] = df[column_name]
Rows that will not match any mask will have NaN values.
PS. Accodring to my test with 10000000 random rows - method with .apply() takes 2min 24s to finish while my method takes only 4,3s.
Use melt to flatten your dataframe and keep rows where Col_name equals to variable column:
df['output'] = df.melt('Col_name', ignore_index=False).query('Col_name == variable')['value']
print(df)
# Output
A B C Col_name output
0 0.202197 0.430735 0.093551 B 0.430735
1 0.344753 0.979453 0.999160 C 0.999160
2 0.500904 0.778715 0.074786 A 0.500904
3 0.050951 0.317732 0.363027 B 0.317732
4 0.722624 0.026065 0.424639 C 0.424639
5 0.578185 0.626698 0.376692 C 0.376692
6 0.540849 0.805722 0.528886 A 0.540849
7 0.918618 0.869893 0.825991 C 0.825991
8 0.688967 0.203809 0.734467 B 0.203809
9 0.811571 0.010081 0.372657 B 0.010081
Transformation after melt:
>>> df.melt('Col_name', ignore_index=False)
Col_name variable value
0 B A 0.202197
1 C A 0.344753
2 A A 0.500904 # keep
3 B A 0.050951
4 C A 0.722624
5 C A 0.578185
6 A A 0.540849 # keep
7 C A 0.918618
8 B A 0.688967
9 B A 0.811571
0 B B 0.430735 # keep
1 C B 0.979453
2 A B 0.778715
3 B B 0.317732 # keep
4 C B 0.026065
5 C B 0.626698
6 A B 0.805722
7 C B 0.869893
8 B B 0.203809 # keep
9 B B 0.010081 # keep
0 B C 0.093551
1 C C 0.999160 # keep
2 A C 0.074786
3 B C 0.363027
4 C C 0.424639 # keep
5 C C 0.376692 # keep
6 A C 0.528886
7 C C 0.825991 # keep
8 B C 0.734467
9 B C 0.372657
Update
Alternative with set_index and stack for #Rabinzel:
df['output'] = (
df.set_index('Col_name', append=True).stack()
.loc[lambda x: x.index.get_level_values(1) == x.index.get_level_values(2)]
.droplevel([1, 2])
)
print(df)
# Output
A B C Col_name output
0 0.209953 0.332294 0.812476 C 0.812476
1 0.284225 0.566939 0.087084 A 0.284225
2 0.815874 0.185154 0.155454 A 0.815874
3 0.017548 0.733474 0.766972 A 0.017548
4 0.494323 0.433719 0.979399 C 0.979399
5 0.875071 0.789891 0.319870 B 0.789891
6 0.475554 0.229837 0.338032 B 0.229837
7 0.123904 0.397463 0.288614 C 0.288614
8 0.288249 0.631578 0.393521 A 0.288249
9 0.107245 0.006969 0.367748 C 0.367748
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['A'] = np.random.random(10)
df['B'] = np.random.random(10)
df['C'] = np.random.random(10)
df['Col_name'] = np.random.choice(['A','B','C'],size=10)
df["output"] = np.nan
Even though you do not like going row per row, I still routinely use loops to go through each row just to know where it breaks when it breaks. Here are two loops just to satisfy myself. The column is created ahead with na values becausethe loops needs it to be.
# each rows by index
for i in range(len(df)):
df['output'][i] = df[df['Col_name'][i]][i]
# each rows but by column name
for col in list(df["Col_name"]):
df.loc[:,'output'] = df.loc[:,col]
Here are some "non-loop" ways to do so.
df["output"] = df.lookup(df.index, df.Col_name)
df['output'] = np.where(np.isnan(df['output']), df[df['Col_name']], np.nan)

Pandaic reasoning behind a way to conditionally update new value from other values in same row in DataFrame

What is the pandaic reasoning behind a way to update a new value in a DataFrame based on other values from the same row?
Given
df = pd.DataFrame([[1,2],[3,4]], columns=list('ab'))
a b
0 1 2
1 3 4
I want
a b c
0 1 2 NaN
1 3 4 3.0
Where the values in column 'c' are set from 'a' if 'b' >= 4.
(1) I tried:
df['c']=df[df['b']>=4]['a']
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
(2) I also tried How can I conditionally update multiple columns in a panda dataframe which sets values from other row values:
df.loc[df['b'] >= 4, 'c'] = df['a']
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
(3) jp also showed a another way:
df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
Which of the above is the most pandic? How does loc work?
Answers to the following did not work:
Update row values where certain condition is met in pandas: sets values from a literal
How to conditionally update DataFrame column in Pandas: sets values from a literal
Other possible way may be to use apply:
df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
print(df)
Result:
a b c
0 1 2 NaN
1 3 4 3.0
Comparing the timings, np.where seems to perform best here among different methods:
%timeit df.loc[df['b'] >= 4, 'c'] = df['a']
1000 loops, best of 3: 1.54 ms per loop
%timeit df['c']=df[df['b']>=4]['a']
1000 loops, best of 3: 869 µs per loop
%timeit df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
1000 loops, best of 3: 440 µs per loop
%timeit df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
1000 loops, best of 3: 359 µs per loop
This will not work because df['c'] is not defined and, if it was, the left is a dataframe while the right is a series:
df[df['b'] >= 4] = df['c']
You cannot assign a series to a dataframe and your assignment is in the wrong direction, so this will never work. However, as you found, the following works:
df.loc[df['b'] >= 4, 'c'] = df['a']
This is because the left and right of this assignment are both series. As an alternative, you can use numpy.where, which you may find more explicit:
df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)

How to slice strings in a column by another column in pandas

df=pd.DataFrame({'A':['abcde','fghij','klmno','pqrst'], 'B':[1,2,3,4]})
I want to slice column A by column B eg: abcde[:1]=a, klmno[:3]=klm
but two statements all failed:
df['new_column']=df.A.map(lambda x: x.str[:df.B])
df['new_column']=df.apply(lambda x: x.A[:x.B])
TypeError: string indices must be integers
and
df['new_column']=df['A'].str[:df['B']]
new_column return NaN
Try to get new_column:
A B new_column
0 abcde 1 a
1 fghij 2 fg
2 klmno 3 klm
3 pqrst 4 pqrs
Thank you so much
You need axis=1 in the apply method to loop through rows:
df['new_column'] = df.apply(lambda r: r.A[:r.B], axis=1)
df
# A B new_column
#0 abcde 1 a
#1 fghij 2 fg
#2 klmno 3 klm
#3 pqrst 4 pqrs
A less idiomatic but usually faster solution is to use zip:
df['new_column'] = [A[:B] for A, B in zip(df.A, df.B)]
df
# A B new_column
#0 abcde 1 a
#1 fghij 2 fg
#2 klmno 3 klm
#3 pqrst 4 pqrs
%timeit df.apply(lambda r: r.A[:r.B], axis=1)
# 1000 loops, best of 3: 440 µs per loop
%timeit [A[:B] for A, B in zip(df.A, df.B)]
# 10000 loops, best of 3: 27.6 µs per loop
By using zip.May this solution is helpful for you.

Apply function for two dataframes in pandas

I have two dataframe.
df0
a b
c 0.3 0.6
d 0.4 NaN
df1
a b
c 3 2
d 0 4
I have a custom function:
def concat(d0,d1):
if d0 is not None and d1 is not None:
return '%s,%s' % (d0, d1)
return None
Result I expect:
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
How could I apply the function for those two dataframe?
Here is a solution.
The idea is first to reduce your dataframes to a flat list of values. This allows you to loop over the value of the two dataframes using zip and applying your function.
Finally, you go back to original shape using numpy reshape
new_vals = [concat(d0,d1) for d0, d1 in zip(df1.values.flat, df2.values.flat)]
result = pd.DataFrame(np.reshape(new_vals, (2, 2)), index = ['c', 'd'], columns = ['a', 'b'])
If you it's your specific application, you can do :
#Concatenate the two as String
df = df0.astype(str) + "," +df1.astype(str)
#Remove the nan
df = df.applymap(lambda x: x if 'nan' not in x else np.nan)
You'll be better performance wise than using apply
output
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
Use add with applymap and mask:
df = df0.astype(str).add(',').add(df1.astype(str))
df = df.mask(df.applymap(lambda x: 'nan' in x))
print (df)
a b
c 0.3,3 0.6,2
d 0.4,0 NaN
Another solution is last replace NaN by conditions with mask, by default Trues are replaced to NaN:
df = df0.astype(str).add(',').add(df1.astype(str))
m = df0.isnull() | df1.isnull()
print (m)
a b
c False False
d False True
df = df.mask(m)
print (df)
a b
c 0.3,3 0.6,2
d 0.4,0 NaN

Comparing a dataframe on string lengths for different columns

I am trying to get the string lengths for different columns. Seems quite straightforward with:
df['a'].str.len()
But I need to apply it to multiple columns. And then get the minimum on it.
Something like:
df[['a','b','c']].str.len().min
I know the above doesn't work, but hopefully you get the idea. Column a, b, c all contain names and I want to retrieve the shortest name.
Also because of huge data, I am avoiding creating other columns to save on size.
I think you need list comprehension, because string function works only with Series (column):
print ([df[col].str.len().min() for col in ['a','b','c']])
Another solution with apply:
print ([df[col].apply(len).min() for col in ['a','b','c']])
Sample:
df = pd.DataFrame({'a':['h','gg','yyy'],
'b':['st','dsws','sw'],
'c':['fffff','','rr'],
'd':[1,3,5]})
print (df)
a b c d
0 h st fffff 1
1 gg dsws 3
2 yyy sw rr 5
print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2, 0]
Timings:
#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']])
100 loops, best of 3: 2.63 ms per loop
In [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']])
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.88 ms per loop
Conclusion:
apply is faster, but not works with None.
df = pd.DataFrame({'a':['h','gg','yyy'],
'b':[None,'dsws','sw'],
'c':['fffff','','rr'],
'd':[1,3,5]})
print (df)
a b c d
0 h None fffff 1
1 gg dsws 3
2 yyy sw rr 5
print ([df[col].apply(len).min() for col in ['a','b','c']])
TypeError: object of type 'NoneType' has no len()
print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2.0, 0]
EDIT by comment:
#fail with None
print (df[['a','b','c']].applymap(len).min(axis=1))
0 1
1 0
2 2
dtype: int64
#working with None
print (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1))
0 1
1 0
2 2
dtype: int64

Categories

Resources