How to slice strings in a column by another column in pandas - python

df=pd.DataFrame({'A':['abcde','fghij','klmno','pqrst'], 'B':[1,2,3,4]})
I want to slice column A by column B eg: abcde[:1]=a, klmno[:3]=klm
but two statements all failed:
df['new_column']=df.A.map(lambda x: x.str[:df.B])
df['new_column']=df.apply(lambda x: x.A[:x.B])
TypeError: string indices must be integers
and
df['new_column']=df['A'].str[:df['B']]
new_column return NaN
Try to get new_column:
A B new_column
0 abcde 1 a
1 fghij 2 fg
2 klmno 3 klm
3 pqrst 4 pqrs
Thank you so much

You need axis=1 in the apply method to loop through rows:
df['new_column'] = df.apply(lambda r: r.A[:r.B], axis=1)
df
# A B new_column
#0 abcde 1 a
#1 fghij 2 fg
#2 klmno 3 klm
#3 pqrst 4 pqrs
A less idiomatic but usually faster solution is to use zip:
df['new_column'] = [A[:B] for A, B in zip(df.A, df.B)]
df
# A B new_column
#0 abcde 1 a
#1 fghij 2 fg
#2 klmno 3 klm
#3 pqrst 4 pqrs
%timeit df.apply(lambda r: r.A[:r.B], axis=1)
# 1000 loops, best of 3: 440 µs per loop
%timeit [A[:B] for A, B in zip(df.A, df.B)]
# 10000 loops, best of 3: 27.6 µs per loop

By using zip.May this solution is helpful for you.

Related

Pandaic reasoning behind a way to conditionally update new value from other values in same row in DataFrame

What is the pandaic reasoning behind a way to update a new value in a DataFrame based on other values from the same row?
Given
df = pd.DataFrame([[1,2],[3,4]], columns=list('ab'))
a b
0 1 2
1 3 4
I want
a b c
0 1 2 NaN
1 3 4 3.0
Where the values in column 'c' are set from 'a' if 'b' >= 4.
(1) I tried:
df['c']=df[df['b']>=4]['a']
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
(2) I also tried How can I conditionally update multiple columns in a panda dataframe which sets values from other row values:
df.loc[df['b'] >= 4, 'c'] = df['a']
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
(3) jp also showed a another way:
df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
Which of the above is the most pandic? How does loc work?
Answers to the following did not work:
Update row values where certain condition is met in pandas: sets values from a literal
How to conditionally update DataFrame column in Pandas: sets values from a literal
Other possible way may be to use apply:
df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
print(df)
Result:
a b c
0 1 2 NaN
1 3 4 3.0
Comparing the timings, np.where seems to perform best here among different methods:
%timeit df.loc[df['b'] >= 4, 'c'] = df['a']
1000 loops, best of 3: 1.54 ms per loop
%timeit df['c']=df[df['b']>=4]['a']
1000 loops, best of 3: 869 µs per loop
%timeit df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
1000 loops, best of 3: 440 µs per loop
%timeit df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
1000 loops, best of 3: 359 µs per loop
This will not work because df['c'] is not defined and, if it was, the left is a dataframe while the right is a series:
df[df['b'] >= 4] = df['c']
You cannot assign a series to a dataframe and your assignment is in the wrong direction, so this will never work. However, as you found, the following works:
df.loc[df['b'] >= 4, 'c'] = df['a']
This is because the left and right of this assignment are both series. As an alternative, you can use numpy.where, which you may find more explicit:
df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)

Pandas string addition across columns

This is perfectly legal in Python:
In [1]: 'abc' + 'def'
Out[1]: 'abcdef'
If I have an all text Pandas DataFrame, like the example below:
In [2]: df = pd.DataFrame([list('abcd'), list('efgh'), list('ijkl')],
columns=['C1','C2','C3','C4'])
df.loc[[0,2], ['C2', 'C3']] = np.nan
df
Out[2]: C1 C2 C3 C4
0 a NaN NaN d
1 e f g h
2 i NaN NaN l
Is it possible to do the same with columns of the above DataFrame? Something like:
In [3]: df.apply(+, axis=1) # Or
df.sum(axis=1)
Note that both of the statements above don't work. Using .str.cat() in a loop is easy, but I am looking for something better.
Expected output is:
Out[3]: C
0 ad
1 efgh
2 il
You could do
df.fillna('').sum(axis=1)
Of course, this assumes that your dataframe is made up only of strings and NaNs.
Option 1
stack
I wanted to add it for demonstration. We don't have to accept the rectangular nature of the dataframe and use stack. When we do, stack drops nan by default. Leaving us with a vector of strings and a pd.MultiIndex. We can groupby the first level of this pd.MultiIndex (which used to be row indices) and perform summation:
df.stack().groupby(level=0).sum()
0 ad
1 efgh
2 il
dtype: object
Option2
Use Masked Arrays np.ma.masked_array
I was motivated by #jezrael to post a faster solution (-:
pd.Series(
np.ma.masked_array(
df.values,
df.isnull().values,
).filled('').sum(1),
df.index
)
0 ad
1 efgh
2 il
dtype: object
Timing
df = pd.concat([df]*1000).reset_index(drop=True)
%%timeit
pd.Series(
np.ma.masked_array(
df.values,
df.isnull().values,
fill_value=''
).filled('').sum(1),
df.index
)
1000 loops, best of 3: 860 µs per loop
%timeit (pd.Series(df.fillna('').values.sum(axis=1), index=df.index))
1000 loops, best of 3: 1.33 ms per loop
A bit faster solution is convert to numpy array by values and then numpy.sum:
#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
#print (df)
In [49]: %timeit (df.fillna('').sum(axis=1))
100 loops, best of 3: 4.08 ms per loop
In [50]: %timeit (pd.Series(df.fillna('').values.sum(axis=1), index=df.index))
1000 loops, best of 3: 1.49 ms per loop
In [51]: %timeit (pd.Series(np.sum(df.fillna('').values, axis=1), index=df.index))
1000 loops, best of 3: 1.5 ms per loop

Splitting Pandas column into multiple columns

Is there a way in Pandas to split a column into multiple columns?
I have a columns in a dataframe where the contents are as follows:
a
[c,a]
b
I would like to split this into:
colA colB colC
a nan nan
a nan c
a b nan
Please note the order of variables in the 2nd row in the original column.
Thanks
Consider the series s
s = pd.Series(['a', ['c', 'a'], 'b'])
s
0 a
1 [c, a]
2 b
dtype: object
Use pd.Series and '|'.join to magically turn into concatenated pipe separated strings. Use str.get_dummies to get array of zeros and ones. Multiply that by the columns to replace ones with column values. where masks the zeros and replaces with np.NaN.
d1 = s.apply(lambda x: '|'.join(pd.Series(x))).str.get_dummies()
d1.mul(d1.columns.values).where(d1.astype(bool))
a b c
0 a NaN NaN
1 a NaN c
2 NaN b NaN
PROJECT/KILL
import itertools
n = len(s)
i = np.arange(n).repeat([len(x) if hasattr(x, '__len__') else 1 for x in s])
j, u = pd.factorize(list(itertools.chain(*s)))
m = u.size
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
pd.DataFrame(np.where(b, u, np.NaN), columns=u)
a b c
0 a NaN NaN
1 a NaN c
2 NaN b NaN
Timing
%%timeit
d1 = s.apply(lambda x: '|'.join(pd.Series(x))).str.get_dummies()
d1.mul(d1.columns.values).where(d1.astype(bool))
100 loops, best of 3: 2.58 ms per loop
%%timeit
n = len(s)
i = np.arange(n).repeat([len(x) if hasattr(x, '__len__') else 1 for x in s])
j, u = pd.factorize(list(itertools.chain(*s)))
m = u.size
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
pd.DataFrame(np.where(b, u, np.NaN), columns=u)
1000 loops, best of 3: 287 µs per loop
%%timeit
s.apply(pd.Series)\
.stack().str.get_dummies().sum(level=0)\
.pipe(lambda x: x.mul(x.columns.values))\
.replace('',np.nan)\
.add_prefix('col')
100 loops, best of 3: 4.24 ms per loop
First stack the lists in the col column, get dummies for each element, and then transform them to a,b,c. Finally rename the columns.
df.col.apply(pd.Series)\
.stack().str.get_dummies().sum(level=0)\
.pipe(lambda x: x.mul(x.columns.values))\
.replace('',np.nan)\
.add_prefix('col')
Out[204]:
cola colb colc
0 a NaN NaN
1 a NaN c
2 NaN b NaN
Assuming you get the column out as a series called s.
s = pd.Series(['a', ['c', 'a'], 'b'])
pd.DataFrame({"col" + x.upper(): s.apply(lambda n: x if x in n else np.NaN)
for x in ['a', 'b', 'c']})

Get column names of specific columns

I get a dataframe from an interface whith cryptically named columns, of which I know some substrings which are mutually exclusive over all columns.
An simplified example looks like this:
df = pandas.DataFrame({'d10432first34sf':[1,2,3],'d10432second34sf':[4,5,6]})
df
d10432first34sf d10432second34sf
0 1 4
1 2 5
2 3 6
Since I know the column substrings, I can access individual columns in the following way:
df.filter(like='first')
d10432first34sf
0 1
1 2
2 3
df.filter(like='second')
d10432second34sf
0 4
1 5
2 6
But now, I also need to get the exact column name of each column, which are unknown to me. How can I achieve that?
Add .columns:
cols = df.filter(like='first').columns
print (cols)
Index(['d10432first34sf'], dtype='object')
Or better boolean indexing with contains:
cols = df.columns[df.columns.str.contains('first')]
print (cols)
Index(['d10432first34sf'], dtype='object')
Timings are not same:
df = pd.DataFrame({'d10432first34sf':[1,2,3],'d10432second34sf':[4,5,6]})
df = pd.concat([df]*10000, axis=1).reset_index(drop=True)
df = pd.concat([df]*1000).reset_index(drop=True)
df.columns = df.columns + pd.Series(range(10000 * 2)).astype('str')
print (df.shape)
(3000, 20000)
In [267]: %timeit df.filter(like='first').columns
10 loops, best of 3: 117 ms per loop
In [268]: %timeit df.columns[df.columns.str.contains('first')]
100 loops, best of 3: 11.9 ms per loop

Comparing a dataframe on string lengths for different columns

I am trying to get the string lengths for different columns. Seems quite straightforward with:
df['a'].str.len()
But I need to apply it to multiple columns. And then get the minimum on it.
Something like:
df[['a','b','c']].str.len().min
I know the above doesn't work, but hopefully you get the idea. Column a, b, c all contain names and I want to retrieve the shortest name.
Also because of huge data, I am avoiding creating other columns to save on size.
I think you need list comprehension, because string function works only with Series (column):
print ([df[col].str.len().min() for col in ['a','b','c']])
Another solution with apply:
print ([df[col].apply(len).min() for col in ['a','b','c']])
Sample:
df = pd.DataFrame({'a':['h','gg','yyy'],
'b':['st','dsws','sw'],
'c':['fffff','','rr'],
'd':[1,3,5]})
print (df)
a b c d
0 h st fffff 1
1 gg dsws 3
2 yyy sw rr 5
print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2, 0]
Timings:
#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']])
100 loops, best of 3: 2.63 ms per loop
In [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']])
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.88 ms per loop
Conclusion:
apply is faster, but not works with None.
df = pd.DataFrame({'a':['h','gg','yyy'],
'b':[None,'dsws','sw'],
'c':['fffff','','rr'],
'd':[1,3,5]})
print (df)
a b c d
0 h None fffff 1
1 gg dsws 3
2 yyy sw rr 5
print ([df[col].apply(len).min() for col in ['a','b','c']])
TypeError: object of type 'NoneType' has no len()
print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2.0, 0]
EDIT by comment:
#fail with None
print (df[['a','b','c']].applymap(len).min(axis=1))
0 1
1 0
2 2
dtype: int64
#working with None
print (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1))
0 1
1 0
2 2
dtype: int64

Categories

Resources