I get a dataframe from an interface whith cryptically named columns, of which I know some substrings which are mutually exclusive over all columns.
An simplified example looks like this:
df = pandas.DataFrame({'d10432first34sf':[1,2,3],'d10432second34sf':[4,5,6]})
df
d10432first34sf d10432second34sf
0 1 4
1 2 5
2 3 6
Since I know the column substrings, I can access individual columns in the following way:
df.filter(like='first')
d10432first34sf
0 1
1 2
2 3
df.filter(like='second')
d10432second34sf
0 4
1 5
2 6
But now, I also need to get the exact column name of each column, which are unknown to me. How can I achieve that?
Add .columns:
cols = df.filter(like='first').columns
print (cols)
Index(['d10432first34sf'], dtype='object')
Or better boolean indexing with contains:
cols = df.columns[df.columns.str.contains('first')]
print (cols)
Index(['d10432first34sf'], dtype='object')
Timings are not same:
df = pd.DataFrame({'d10432first34sf':[1,2,3],'d10432second34sf':[4,5,6]})
df = pd.concat([df]*10000, axis=1).reset_index(drop=True)
df = pd.concat([df]*1000).reset_index(drop=True)
df.columns = df.columns + pd.Series(range(10000 * 2)).astype('str')
print (df.shape)
(3000, 20000)
In [267]: %timeit df.filter(like='first').columns
10 loops, best of 3: 117 ms per loop
In [268]: %timeit df.columns[df.columns.str.contains('first')]
100 loops, best of 3: 11.9 ms per loop
Related
What is the pandaic reasoning behind a way to update a new value in a DataFrame based on other values from the same row?
Given
df = pd.DataFrame([[1,2],[3,4]], columns=list('ab'))
a b
0 1 2
1 3 4
I want
a b c
0 1 2 NaN
1 3 4 3.0
Where the values in column 'c' are set from 'a' if 'b' >= 4.
(1) I tried:
df['c']=df[df['b']>=4]['a']
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
(2) I also tried How can I conditionally update multiple columns in a panda dataframe which sets values from other row values:
df.loc[df['b'] >= 4, 'c'] = df['a']
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
(3) jp also showed a another way:
df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
a b c
0 1 2 NaN
1 3 4 3.0
which worked.
Which of the above is the most pandic? How does loc work?
Answers to the following did not work:
Update row values where certain condition is met in pandas: sets values from a literal
How to conditionally update DataFrame column in Pandas: sets values from a literal
Other possible way may be to use apply:
df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
print(df)
Result:
a b c
0 1 2 NaN
1 3 4 3.0
Comparing the timings, np.where seems to perform best here among different methods:
%timeit df.loc[df['b'] >= 4, 'c'] = df['a']
1000 loops, best of 3: 1.54 ms per loop
%timeit df['c']=df[df['b']>=4]['a']
1000 loops, best of 3: 869 µs per loop
%timeit df['c'] = df.apply(lambda row: row['a'] if row['b'] >=4 else None, axis=1)
1000 loops, best of 3: 440 µs per loop
%timeit df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
1000 loops, best of 3: 359 µs per loop
This will not work because df['c'] is not defined and, if it was, the left is a dataframe while the right is a series:
df[df['b'] >= 4] = df['c']
You cannot assign a series to a dataframe and your assignment is in the wrong direction, so this will never work. However, as you found, the following works:
df.loc[df['b'] >= 4, 'c'] = df['a']
This is because the left and right of this assignment are both series. As an alternative, you can use numpy.where, which you may find more explicit:
df['c'] = np.where(df['b'] >= 4, df['a'], np.nan)
I am trying to make a piece of my code run quicker.
I have two dataframes of different sizes, A and B. I have a dictionary of ages too called age_dict.
A contains 100 rows, and B contains 200 rows. They both use an index starting at 0. They both have two columns which are "Name" and "Age"
The dictionary keys are names and their values are ages. All keys are unique, there are no duplicates
{'John':20,'Max':25,'Jack':30}
I want to find the names in each DataFrame and assign them the age from the dictionary. I achieve this using the following code (I want to return a new DataFrame and not amend the old one):
def age(df):
new_df = df.copy(deep=True)
i = 0
while i < len(new_df['Name']):
name = new_df['Name'][i]
age = age_dict[name]
new_df['Age'][i] = age
i += 1
return new_df
new_A = age(A)
new_B = age(B)
This code takes longer than I want it to, so I'm wondering if pandas has an easier way to do this instead of me looping through each row?
Thank you!
I think you need map:
A = pd.DataFrame({'Name':['John','Max','Joe']})
print (A)
Name
0 John
1 Max
2 Joe
d = {'John':20,'Max':25,'Jack':30}
A1 = A.copy(deep=True)
A1['Age'] = A.Name.map(d)
print (A1)
Name Age
0 John 20.0
1 Max 25.0
2 Joe NaN
If need function:
d = {'John':20,'Max':25,'Jack':30}
def age(df):
new_df = df.copy(deep=True)
new_df['Age'] = new_df.Name.map(d)
return new_df
new_A = age(A)
print (new_A)
Name Age
0 John 20.0
1 Max 25.0
2 Joe NaN
Timings:
In [191]: %timeit (age(A))
10 loops, best of 3: 21.8 ms per loop
In [192]: %timeit (jul(A))
10 loops, best of 3: 47.6 ms per loop
Code for timings:
A = pd.DataFrame({'Name':['John','Max','Joe']})
#[300000 rows x 2 columns]
A = pd.concat([A]*100000).reset_index(drop=True)
print (A)
d = {'John':20,'Max':25,'Jack':30}
def age(df):
new_df = df.copy(deep=True)
new_df['Age'] = new_df.Name.map(d)
return new_df
def jul(A):
df = pd.DataFrame({'Name': list(d.keys()), 'Age': list(d.values())})
A1 = pd.merge(A, df, how='left')
return A1
A = pd.DataFrame({'Name':['John','Max','Joe']})
#[300 rows x 2 columns]
A = pd.concat([A]*100).reset_index(drop=True)
In [194]: %timeit (age(A))
The slowest run took 5.22 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 742 µs per loop
In [195]: %timeit (jul(A))
The slowest run took 4.51 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.87 ms per loop
You can create a another dataframe with your dict and merge the two dataframes based on a common key:
d = {'John':20,'Max':25,'Jack':30}
A = pd.DataFrame({'Name':['John','Max','Joe']})
df = pd.DataFrame({'Name': d.keys(), 'Age': d.values()})
A1 = pd.merge(A, df, how='left')
# Name Age
# 0 John 20
# 1 Max 25
# 2 Joe NaN
I am trying to get the string lengths for different columns. Seems quite straightforward with:
df['a'].str.len()
But I need to apply it to multiple columns. And then get the minimum on it.
Something like:
df[['a','b','c']].str.len().min
I know the above doesn't work, but hopefully you get the idea. Column a, b, c all contain names and I want to retrieve the shortest name.
Also because of huge data, I am avoiding creating other columns to save on size.
I think you need list comprehension, because string function works only with Series (column):
print ([df[col].str.len().min() for col in ['a','b','c']])
Another solution with apply:
print ([df[col].apply(len).min() for col in ['a','b','c']])
Sample:
df = pd.DataFrame({'a':['h','gg','yyy'],
'b':['st','dsws','sw'],
'c':['fffff','','rr'],
'd':[1,3,5]})
print (df)
a b c d
0 h st fffff 1
1 gg dsws 3
2 yyy sw rr 5
print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2, 0]
Timings:
#[3000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
In [17]: %timeit ([df[col].apply(len).min() for col in ['a','b','c']])
100 loops, best of 3: 2.63 ms per loop
In [18]: %timeit ([df[col].str.len().min() for col in ['a','b','c']])
The slowest run took 4.12 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 2.88 ms per loop
Conclusion:
apply is faster, but not works with None.
df = pd.DataFrame({'a':['h','gg','yyy'],
'b':[None,'dsws','sw'],
'c':['fffff','','rr'],
'd':[1,3,5]})
print (df)
a b c d
0 h None fffff 1
1 gg dsws 3
2 yyy sw rr 5
print ([df[col].apply(len).min() for col in ['a','b','c']])
TypeError: object of type 'NoneType' has no len()
print ([df[col].str.len().min() for col in ['a','b','c']])
[1, 2.0, 0]
EDIT by comment:
#fail with None
print (df[['a','b','c']].applymap(len).min(axis=1))
0 1
1 0
2 2
dtype: int64
#working with None
print (df[['a','b','c']].apply(lambda x: x.str.len().min(), axis=1))
0 1
1 0
2 2
dtype: int64
I need to drop all rows where a one column are below a certain value. I used the command below, but this returns the column as an object. I need to keep it as int64:
df["customer_id"] = df.drop(df["customer_id"][df["customer_id"] < 9999999].index)
df = df.dropna()
I have tried to re-cast the field as int64 after, but this causes the following error with data from a totally different column:
invalid literal for long() with base 10: '2014/03/09 11:12:27'
I think you need boolean indexing with reset_index:
import pandas as pd
df = pd.DataFrame({'a': ['s', 'd', 'f', 'g'],
'customer_id':[99999990, 99999997, 1000, 8888]})
print (df)
a customer_id
0 s 99999990
1 d 99999997
2 f 1000
3 g 8888
df1 = df[df["customer_id"] > 9999999].reset_index(drop=True)
print (df1)
a customer_id
0 s 99999990
1 d 99999997
Solution with drop, but is slowier:
df2 = (df.drop(df.loc[df["customer_id"] < 9999999, 'customer_id'].index))
print (df2)
a customer_id
0 s 99999990
1 d 99999997
Timings:
In [12]: %timeit df[df["customer_id"] > 9999999].reset_index(drop=True)
1000 loops, best of 3: 676 µs per loop
In [13]: %timeit (df.drop(df.loc[df["customer_id"] < 9999999, 'customer_id'].index))
1000 loops, best of 3: 921 µs per loop
What's wrong with slicing the whole frame (and reindexing if necessary)?
df = df[df["customer_id"] < 9999999]
df.index = range(0,len(df))
I have two Pandas DataFrames. I would like to add the rows of the other dataframe as columns in the other. I've tried reading through the Merge, join, and concatenate - documentation, but can't get my head around how to do this in Pandas.
Here's how I've managed to do it with converting to numpy arrays, but surely there is a smart way to do this in Pandas.
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.random.normal(size=8).reshape(4,2),index=[1,2,3,4],columns=['a','b'])
df2 = pd.DataFrame(np.random.normal(size=8).reshape(2,4),index=['c','d'],columns=[5,6,7,8])
ar = np.concatenate((df1.values,df2.values.T),axis=1)
df = pd.DataFrame(ar,columns=['a','b','c','d'],index=[1,2,3,4])
If df1.index has no duplicate values, then you could use df1.join:
In [283]: df1 = pd.DataFrame(np.random.normal(size=8).reshape(4,2),index=[1,2,3,4],columns=['a','b'])
In [284]: df2 = pd.DataFrame(np.random.normal(size=8).reshape(2,4),index=['c','d'],columns=[5,6,7,8])
In [285]: df1.join(df2.T.set_index(df1.index))
Out[285]:
a b c d
1 -1.196281 0.222283 1.247750 -0.121309
2 1.188098 0.384871 -1.324419 -1.610255
3 -0.928642 -0.618491 0.171215 -1.545479
4 -0.832756 -0.491364 0.100428 -0.525689
If df1 has duplicate entries in its index, then df1.join(...) may return more rows than desired. For example, if df1 has non-unique index [1,2,1,4] then:
In [4]: df1 = pd.DataFrame(np.random.normal(size=8).reshape(4,2),index=[1,2,1,4],columns=['a','b'])
In [5]: df2 = pd.DataFrame(np.random.normal(size=8).reshape(2,4),index=['c','d'],columns=[5,6,7,8])
In [8]: df1.join(df2.T.set_index(df1.index))
Out[8]:
a b c d
1 -1.087152 -0.828800 -1.129768 -0.579428
1 -1.087152 -0.828800 0.320756 0.297736
1 0.198297 0.277456 -1.129768 -0.579428
1 0.198297 0.277456 0.320756 0.297736
2 1.529188 1.023568 -0.670853 -0.466754
4 -0.393748 0.976632 0.455129 1.230298
The 2 rows with index 1 in df1 are being joined to the 2 rows with index 1 in df2 resulting in 4 rows with index 1 -- probably not what you want.
So, if df1.index does contain duplicate values, use pd.concat to guarantee a simple juxtaposition of the two shapes:
In [7]: pd.concat([df1, df2.T.set_index(df1.index)], axis=1)
Out[7]:
a b c d
1 -1.087152 -0.828800 -1.129768 -0.579428
2 1.529188 1.023568 -0.670853 -0.466754
1 0.198297 0.277456 0.320756 0.297736
4 -0.393748 0.976632 0.455129 1.230298
One reason you might want to use df1.join, however, is that
if you know df1.index has no duplicate values, then using it
is faster than using pd.concat:
In [13]: df1 = pd.DataFrame(np.random.normal(size=8000).reshape(-1,2), columns=['a','b'])
In [14]: df2 = pd.DataFrame(np.random.normal(size=8000).reshape(2,-1),index=['c','d'])
In [15]: %timeit df1.join(df2.T.set_index(df1.index))
1000 loops, best of 3: 600 µs per loop
In [16]: %timeit pd.concat([df1, df2.T.set_index(df1.index)], axis=1)
1000 loops, best of 3: 1.18 ms per loop