How can I square each element of a column/series of a DataFrame in pandas (and create another column to hold the result)?
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2],[3,4]], columns=list('ab'))
>>> df
a b
0 1 2
1 3 4
>>> df['c'] = df['b']**2
>>> df
a b c
0 1 2 4
1 3 4 16
Nothing wrong with the accepted answer, there is also:
df = pd.DataFrame({'a': range(0,100)})
np.square(df)
np.power(df, 2)
Which is ever so slightly faster:
In [11]: %timeit df ** 2
10000 loops, best of 3: 95.9 µs per loop
In [13]: %timeit np.square(df)
10000 loops, best of 3: 85 µs per loop
In [15]: %timeit np.power(df, 2)
10000 loops, best of 3: 85.6 µs per loop
You can also use pandas.DataFrame.pow() method.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2], [3,4]], columns=list('ab'))
>>> df
a b
0 1 2
1 3 4
>>> df['c'] = df['b'].pow(2)
>>> df
a b c
0 1 2 4
1 3 4 16
Related
Whats the most pythonic way to add a column (of weights) to an existing Pandas DataFrame "df" by a condition on dfs column?
Small example:
df = pd.DataFrame({'A' : [1, 2, 3], 'B' : [4, 5, 6]})
df
Out[110]:
A B
0 1 4
1 2 5
2 3 6
I'd Like to add a "weight" column where if df['B'] >= 6 then df['weight'] = 20, else, df['weight'] = 1
So my output will be:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Approach #1
Here's one with type-conversion and scaling -
df['weight'] = (df['B'] >= 6)*19+1
Approach #2
Another possibly faster one with using the underlying array data -
df['weight'] = (df['B'].values >= 6)*19+1
Approach #3
Leverage multi-cores with numexpr module -
import numexpr as ne
val = df['B'].values
df['weight'] = ne.evaluate('(val >= 6)*19+1')
Timings on 500k rows as commented by OP for a random data in range [0,9) for the vectorized methods posted thus far -
In [149]: np.random.seed(0)
...: df = pd.DataFrame({'B' : np.random.randint(0,9,(500000))})
# #jpp's soln
In [150]: %timeit df['weight1'] = np.where(df['B'] >= 6, 20, 1)
100 loops, best of 3: 3.57 ms per loop
# #jpp's soln with array data
In [151]: %timeit df['weight2'] = np.where(df['B'].values >= 6, 20, 1)
100 loops, best of 3: 3.27 ms per loop
In [154]: %timeit df['weight3'] = (df['B'] >= 6)*19+1
100 loops, best of 3: 2.73 ms per loop
In [155]: %timeit df['weight4'] = (df['B'].values >= 6)*19+1
1000 loops, best of 3: 1.76 ms per loop
In [156]: %%timeit
...: val = df['B'].values
...: df['weight5'] = ne.evaluate('(val >= 6)*19+1')
1000 loops, best of 3: 1.14 ms per loop
One last one ...
With the output being 1 or 20, we could safely use lower precision : uint8 for a turbo speedup over already discussed ones, like so -
In [208]: %timeit df['weight6'] = (df['B'].values >= 6)*np.uint8(19)+1
1000 loops, best of 3: 428 µs per loop
You can use numpy.where for a vectorised solution:
df['weight'] = np.where(df['B'] >= 6, 20, 1)
Result:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Here's a method using df.apply
df['weight'] = df.apply(lambda row: 20 if row['B'] >= 6 else 1, axis=1)
Output:
In [6]: df
Out[6]:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:
df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])
looks like
id value
0 1 3
1 1 6
2 1 3
3 2 2
4 2 1
Now I wish to obtain the following DataFrame:
id value
0 1 3
1 1 6
2 1 6
3 2 2
4 2 2
The problem is that when I do
df.groupby('id')['value'].rolling(1).max()
I get the same DataFrame back. And when I do
df.groupby('id')['value'].rolling(3).max()
I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?
It looks like you need cummax() instead of .rolling(N).max()
In [29]: df['new'] = df.groupby('id').value.cummax()
In [30]: df
Out[30]:
id value new
0 1 3 3
1 1 6 6
2 1 3 6
3 2 2 2
4 2 1 2
Timing (using brand new Pandas version 0.20.1):
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [4]: df.shape
Out[4]: (50000, 2)
In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop
In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop
NOTE: from Pandas 0.20.0 what's new
Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
Using apply will be a tiny bit faster:
# Using apply
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop
Other method:
df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop
I have the following df in pandas.
0 A B C
1 2 NaN 8
How can I check if df.iloc[1]['B'] is NaN?
I tried using df.isnan() and I get a table like this:
0 A B C
1 false true false
but I am not sure how to index the table and if this is an efficient way of performing the job at all?
Use pd.isnull, for select use loc or iloc:
print (df)
0 A B C
0 1 2 NaN 8
print (df.loc[0, 'B'])
nan
a = pd.isnull(df.loc[0, 'B'])
print (a)
True
print (df['B'].iloc[0])
nan
a = pd.isnull(df['B'].iloc[0])
print (a)
True
jezrael response is spot on. If you are only concern with NaN value, I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
If you are looking for the indexes of NaN in a specific column you can use
list(df['B'].index[df['B'].apply(np.isnan)])
In case you what to get the indexes of all possible NaN values in the dataframe you may do the following
row_col_indexes = list(map(list, np.where(np.isnan(np.array(df)))))
indexes = []
for i in zip(row_col_indexes[0], row_col_indexes[1]):
indexes.append(list(i))
And if you are looking for a one liner you can use:
list(zip(*[x for x in list(map(list, np.where(np.isnan(np.array(df)))))]))
Given a dataframe a with 3 columns, A , B , C and 3 rows of numerical values. How does one sort all the rows with a comp operator using only the product of A[i]*B[i]. It seems that the pandas sort only takes columns and then a sort method.
I would like to use a comparison function like below.
f = lambda i,j: a['A'][i]*a['B'][i] < a['A'][j]*a['B'][j]
There are at least two ways:
Method 1
Say you start with
In [175]: df = pd.DataFrame({'A': [1, 2], 'B': [1, -1], 'C': [1, 1]})
You can add a column which is your sort key
In [176]: df['sort_val'] = df.A * df.B
Finally sort by it and drop it
In [190]: df.sort_values('sort_val').drop('sort_val', 1)
Out[190]:
A B C
1 2 -1 1
0 1 1 1
Method 2
Use numpy.argsort and then use .ix on the resulting indices:
In [197]: import numpy as np
In [198]: df.ix[np.argsort(df.A * df.B).values]
Out[198]:
A B C
0 1 1 1
1 2 -1 1
Another way, adding it here because this is the first result at Google:
df.loc[(df.A * df.B).sort_values().index]
This works well for me and is pretty straightforward. #Ami Tavory's answer gave strange results for me with a categorical index; not sure it's because of that though.
Just adding on #srs super elegant answer an iloc option with some time comparisons with loc and the naive solution.
(iloc is preferred for when your your index is position-based (vs label-based for loc)
import numpy as np
import pandas as pd
N = 10000
df = pd.DataFrame({
'A': np.random.randint(low=1, high=N, size=N),
'B': np.random.randint(low=1, high=N, size=N)
})
%%timeit -n 100
df['C'] = df['A'] * df['B']
df.sort_values(by='C')
naive: 100 loops, best of 3: 1.85 ms per loop
%%timeit -n 100
df.loc[(df.A * df.B).sort_values().index]
loc: 100 loops, best of 3: 2.69 ms per loop
%%timeit -n 100
df.iloc[(df.A * df.B).sort_values().index]
iloc: 100 loops, best of 3: 2.02 ms per loop
df['C'] = df['A'] * df['B']
df1 = df.sort_values(by='C')
df2 = df.loc[(df.A * df.B).sort_values().index]
df3 = df.iloc[(df.A * df.B).sort_values().index]
print np.array_equal(df1.index, df2.index)
print np.array_equal(df2.index, df3.index)
testing results (comparing the entire index order) between all options:
True
True
I have a pandas dataframe I want to replace a certain column conditionally.
eg:
col
0 Mr
1 Miss
2 Mr
3 Mrs
4 Col.
I want to map them as
{'Mr': 0, 'Mrs': 1, 'Miss': 2}
If there are other titles now available in the dict then I want them to have a default value of 3
The above example becomes
col
0 0
1 2
2 0
3 1
4 3
Can I do this with pandas.replace() without using regex ?
You can use map rather as replace, because faster, then fillna by 3 and cast to int by astype:
df['col'] = df.col.map({'Mr': 0, 'Mrs': 1, 'Miss': 2}).fillna(3).astype(int)
print (df)
col
0 0
1 2
2 0
3 1
4 3
Another solution with numpy.where and condition with isin:
d = {'Mr': 0, 'Mrs': 1, 'Miss': 2}
df['col'] = np.where(df.col.isin(d.keys()), df.col.map(d), 3).astype(int)
print (df)
col
0 0
1 2
2 0
3 1
4 3
Solution with replace:
d = {'Mr': 0, 'Mrs': 1, 'Miss': 2}
df['col'] = np.where(df.col.isin(d.keys()), df.col.replace(d), 3)
print (df)
col
0 0
1 2
2 0
3 1
4 3
Timings:
df = pd.concat([df]*10000).reset_index(drop=True)
d = {'Mr': 0, 'Mrs': 1, 'Miss': 2}
df['col0'] = df.col.map(d).fillna(3).astype(int)
df['col1'] = np.where(df.col.isin(d.keys()), df.col.replace(d), 3)
df['col2'] = np.where(df.col.isin(d.keys()), df.col.map(d), 3).astype(int)
print (df)
In [447]: %timeit df['col0'] = df.col.map(d).fillna(3).astype(int)
100 loops, best of 3: 4.93 ms per loop
In [448]: %timeit df['col1'] = np.where(df.col.isin(d.keys()), df.col.replace(d), 3)
100 loops, best of 3: 14.3 ms per loop
In [449]: %timeit df['col2'] = np.where(df.col.isin(d.keys()), df.col.map(d), 3).astype(int)
100 loops, best of 3: 7.68 ms per loop
In [450]: %timeit df['col3'] = df.col.map(lambda L: d.get(L, 3))
10 loops, best of 3: 36.2 ms per loop
To add on the answer by #jezrael: The most straight forward solution is to use a defaultdict instead of dict. This is especially useful when you want missing values not to be replaced with your default value.
from collections import defaultdict
df['col'] = df.col.map(defaultdict(lambda: 3,Mr= 0, Mrs= 1, Miss= 2),na_action='ignore')
The first argument of defaultdict is a function that return the default value.