This is perfectly legal in Python:
In [1]: 'abc' + 'def'
Out[1]: 'abcdef'
If I have an all text Pandas DataFrame, like the example below:
In [2]: df = pd.DataFrame([list('abcd'), list('efgh'), list('ijkl')],
columns=['C1','C2','C3','C4'])
df.loc[[0,2], ['C2', 'C3']] = np.nan
df
Out[2]: C1 C2 C3 C4
0 a NaN NaN d
1 e f g h
2 i NaN NaN l
Is it possible to do the same with columns of the above DataFrame? Something like:
In [3]: df.apply(+, axis=1) # Or
df.sum(axis=1)
Note that both of the statements above don't work. Using .str.cat() in a loop is easy, but I am looking for something better.
Expected output is:
Out[3]: C
0 ad
1 efgh
2 il
You could do
df.fillna('').sum(axis=1)
Of course, this assumes that your dataframe is made up only of strings and NaNs.
Option 1
stack
I wanted to add it for demonstration. We don't have to accept the rectangular nature of the dataframe and use stack. When we do, stack drops nan by default. Leaving us with a vector of strings and a pd.MultiIndex. We can groupby the first level of this pd.MultiIndex (which used to be row indices) and perform summation:
df.stack().groupby(level=0).sum()
0 ad
1 efgh
2 il
dtype: object
Option2
Use Masked Arrays np.ma.masked_array
I was motivated by #jezrael to post a faster solution (-:
pd.Series(
np.ma.masked_array(
df.values,
df.isnull().values,
).filled('').sum(1),
df.index
)
0 ad
1 efgh
2 il
dtype: object
Timing
df = pd.concat([df]*1000).reset_index(drop=True)
%%timeit
pd.Series(
np.ma.masked_array(
df.values,
df.isnull().values,
fill_value=''
).filled('').sum(1),
df.index
)
1000 loops, best of 3: 860 µs per loop
%timeit (pd.Series(df.fillna('').values.sum(axis=1), index=df.index))
1000 loops, best of 3: 1.33 ms per loop
A bit faster solution is convert to numpy array by values and then numpy.sum:
#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
#print (df)
In [49]: %timeit (df.fillna('').sum(axis=1))
100 loops, best of 3: 4.08 ms per loop
In [50]: %timeit (pd.Series(df.fillna('').values.sum(axis=1), index=df.index))
1000 loops, best of 3: 1.49 ms per loop
In [51]: %timeit (pd.Series(np.sum(df.fillna('').values, axis=1), index=df.index))
1000 loops, best of 3: 1.5 ms per loop
Related
What is the pandaic reasoning behind a way to update a new value in a DataFrame based on other values from the same row?
Given
df = pd.DataFrame([[1,2],[3,4]], columns=list('ab'))
a b
0 1 2
1 3 4
I want
a b c
0 1 2 NaN
1 3 4 3.0
Where the values in column 'c' are set from 'a' if 'b' >= 4.
(1) I tried:
df['c']=df[df['b']>=4]['a']
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
(2) I also tried How can I conditionally update multiple columns in a panda dataframe which sets values from other row values:
df.loc[df['b'] >= 4, 'c'] = df['a']
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
(3) jp also showed a another way:
df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
Which of the above is the most pandic? How does loc work?
Answers to the following did not work:
Update row values where certain condition is met in pandas: sets values from a literal
How to conditionally update DataFrame column in Pandas: sets values from a literal
Other possible way may be to use apply:
df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
print(df)
Result:
a b c
0 1 2 NaN
1 3 4 3.0
Comparing the timings, np.where seems to perform best here among different methods:
%timeit df.loc[df['b'] >= 4, 'c'] = df['a']
1000 loops, best of 3: 1.54 ms per loop
%timeit df['c']=df[df['b']>=4]['a']
1000 loops, best of 3: 869 µs per loop
%timeit df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
1000 loops, best of 3: 440 µs per loop
%timeit df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
1000 loops, best of 3: 359 µs per loop
This will not work because df['c'] is not defined and, if it was, the left is a dataframe while the right is a series:
df[df['b'] >= 4] = df['c']
You cannot assign a series to a dataframe and your assignment is in the wrong direction, so this will never work. However, as you found, the following works:
df.loc[df['b'] >= 4, 'c'] = df['a']
This is because the left and right of this assignment are both series. As an alternative, you can use numpy.where, which you may find more explicit:
df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
df=pd.DataFrame({'A':['abcde','fghij','klmno','pqrst'], 'B':[1,2,3,4]})
I want to slice column A by column B eg: abcde[:1]=a, klmno[:3]=klm
but two statements all failed:
df['new_column']=df.A.map(lambda x: x.str[:df.B])
df['new_column']=df.apply(lambda x: x.A[:x.B])
TypeError: string indices must be integers
and
df['new_column']=df['A'].str[:df['B']]
new_column return NaN
Try to get new_column:
A B new_column
0 abcde 1 a
1 fghij 2 fg
2 klmno 3 klm
3 pqrst 4 pqrs
Thank you so much
You need axis=1 in the apply method to loop through rows:
df['new_column'] = df.apply(lambda r: r.A[:r.B], axis=1)
df
# A B new_column
#0 abcde 1 a
#1 fghij 2 fg
#2 klmno 3 klm
#3 pqrst 4 pqrs
A less idiomatic but usually faster solution is to use zip:
df['new_column'] = [A[:B] for A, B in zip(df.A, df.B)]
df
# A B new_column
#0 abcde 1 a
#1 fghij 2 fg
#2 klmno 3 klm
#3 pqrst 4 pqrs
%timeit df.apply(lambda r: r.A[:r.B], axis=1)
# 1000 loops, best of 3: 440 µs per loop
%timeit [A[:B] for A, B in zip(df.A, df.B)]
# 10000 loops, best of 3: 27.6 µs per loop
By using zip.May this solution is helpful for you.
I get a dataframe from an interface whith cryptically named columns, of which I know some substrings which are mutually exclusive over all columns.
An simplified example looks like this:
df = pandas.DataFrame({'d10432first34sf':[1,2,3],'d10432second34sf':[4,5,6]})
df
d10432first34sf d10432second34sf
0 1 4
1 2 5
2 3 6
Since I know the column substrings, I can access individual columns in the following way:
df.filter(like='first')
d10432first34sf
0 1
1 2
2 3
df.filter(like='second')
d10432second34sf
0 4
1 5
2 6
But now, I also need to get the exact column name of each column, which are unknown to me. How can I achieve that?
Add .columns:
cols = df.filter(like='first').columns
print (cols)
Index(['d10432first34sf'], dtype='object')
Or better boolean indexing with contains:
cols = df.columns[df.columns.str.contains('first')]
print (cols)
Index(['d10432first34sf'], dtype='object')
Timings are not same:
df = pd.DataFrame({'d10432first34sf':[1,2,3],'d10432second34sf':[4,5,6]})
df = pd.concat([df]*10000, axis=1).reset_index(drop=True)
df = pd.concat([df]*1000).reset_index(drop=True)
df.columns = df.columns + pd.Series(range(10000 * 2)).astype('str')
print (df.shape)
(3000, 20000)
In [267]: %timeit df.filter(like='first').columns
10 loops, best of 3: 117 ms per loop
In [268]: %timeit df.columns[df.columns.str.contains('first')]
100 loops, best of 3: 11.9 ms per loop
I am trying to make a piece of my code run quicker.
I have two dataframes of different sizes, A and B. I have a dictionary of ages too called age_dict.
A contains 100 rows, and B contains 200 rows. They both use an index starting at 0. They both have two columns which are "Name" and "Age"
The dictionary keys are names and their values are ages. All keys are unique, there are no duplicates
{'John':20,'Max':25,'Jack':30}
I want to find the names in each DataFrame and assign them the age from the dictionary. I achieve this using the following code (I want to return a new DataFrame and not amend the old one):
def age(df):
new_df = df.copy(deep=True)
i = 0
while i < len(new_df['Name']):
name = new_df['Name'][i]
age = age_dict[name]
new_df['Age'][i] = age
i += 1
return new_df
new_A = age(A)
new_B = age(B)
This code takes longer than I want it to, so I'm wondering if pandas has an easier way to do this instead of me looping through each row?
Thank you!
I think you need map:
A = pd.DataFrame({'Name':['John','Max','Joe']})
print (A)
Name
0 John
1 Max
2 Joe
d = {'John':20,'Max':25,'Jack':30}
A1 = A.copy(deep=True)
A1['Age'] = A.Name.map(d)
print (A1)
Name Age
0 John 20.0
1 Max 25.0
2 Joe NaN
If need function:
d = {'John':20,'Max':25,'Jack':30}
def age(df):
new_df = df.copy(deep=True)
new_df['Age'] = new_df.Name.map(d)
return new_df
new_A = age(A)
print (new_A)
Name Age
0 John 20.0
1 Max 25.0
2 Joe NaN
Timings:
In [191]: %timeit (age(A))
10 loops, best of 3: 21.8 ms per loop
In [192]: %timeit (jul(A))
10 loops, best of 3: 47.6 ms per loop
Code for timings:
A = pd.DataFrame({'Name':['John','Max','Joe']})
#[300000 rows x 2 columns]
A = pd.concat([A]*100000).reset_index(drop=True)
print (A)
d = {'John':20,'Max':25,'Jack':30}
def age(df):
new_df = df.copy(deep=True)
new_df['Age'] = new_df.Name.map(d)
return new_df
def jul(A):
df = pd.DataFrame({'Name': list(d.keys()), 'Age': list(d.values())})
A1 = pd.merge(A, df, how='left')
return A1
A = pd.DataFrame({'Name':['John','Max','Joe']})
#[300 rows x 2 columns]
A = pd.concat([A]*100).reset_index(drop=True)
In [194]: %timeit (age(A))
The slowest run took 5.22 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 742 µs per loop
In [195]: %timeit (jul(A))
The slowest run took 4.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.87 ms per loop
You can create a another dataframe with your dict and merge the two dataframes based on a common key:
d = {'John':20,'Max':25,'Jack':30}
A = pd.DataFrame({'Name':['John','Max','Joe']})
df = pd.DataFrame({'Name': d.keys(), 'Age': d.values()})
A1 = pd.merge(A, df, how='left')
# Name Age
# 0 John 20
# 1 Max 25
# 2 Joe NaN
I am trying to get the string lengths for different columns. Seems quite straightforward with:
df['a'].str.len()
But I need to apply it to multiple columns. And then get the minimum on it.
Something like:
df[['a','b','c']].str.len().min
I know the above doesn't work, but hopefully you get the idea. Column a, b, c all contain names and I want to retrieve the shortest name.
Also because of huge data, I am avoiding creating other columns to save on size.
I think you need list comprehension, because string function works only with Series (column):
print ([df[col].str.len().min() for col in ['a','b','c']])
Another solution with apply:
print ([df[col].apply(len).min() for col in ['a','b','c']])
Sample:
df = pd.DataFrame({'a':['h','gg','yyy'],
'b':['st','dsws','sw'],
'c':['fffff','','rr'],
'd':[1,3,5]})
print (df)
a b c d
0 h st fffff 1
1 gg dsws 3
2 yyy sw rr 5
print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2, 0]
Timings:
#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']])
100 loops, best of 3: 2.63 ms per loop
In [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']])
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.88 ms per loop
Conclusion:
apply is faster, but not works with None.
df = pd.DataFrame({'a':['h','gg','yyy'],
'b':[None,'dsws','sw'],
'c':['fffff','','rr'],
'd':[1,3,5]})
print (df)
a b c d
0 h None fffff 1
1 gg dsws 3
2 yyy sw rr 5
print ([df[col].apply(len).min() for col in ['a','b','c']])
TypeError: object of type 'NoneType' has no len()
print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2.0, 0]
EDIT by comment:
#fail with None
print (df[['a','b','c']].applymap(len).min(axis=1))
0 1
1 0
2 2
dtype: int64
#working with None
print (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1))
0 1
1 0
2 2
dtype: int64