python pandas - input values into new column - python

I have a small dataframe below of spending of 4 persons.
There is an empty column called 'Grade'.
I would like to rate those who spent more than $100 grade A, and grade B for those less than $100.
What is the most efficient method of filling up column 'Grade', assuming it is a big dataframe?
import pandas as pd
df=pd.DataFrame({'Customer':['Bob','Ken','Steve','Joe'],
'Spending':[130,22,313,46]})
df['Grade']=''

You can use numpy.where:
df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
print (df)
Customer Spending Grade
0 Bob 130 A
1 Ken 22 B
2 Steve 313 A
3 Joe 46 B
Timings:
df=pd.DataFrame({'Customer':['Bob','Ken','Steve','Joe'],
'Spending':[130,22,313,46]})
#[400000 rows x 4 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B')
10 loops, best of 3: 21.6 ms per loop
In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)
1 loop, best of 3: 7.08 s per loop

Fastest way to do that would be to use lambda function with an apply function.
df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1)

Related

Adding column of weights to pandas DF by a condition on the DFs column

Whats the most pythonic way to add a column (of weights) to an existing Pandas DataFrame "df" by a condition on dfs column?
Small example:
df = pd.DataFrame({'A' : [1, 2, 3], 'B' : [4, 5, 6]})
df
Out[110]:
A B
0 1 4
1 2 5
2 3 6
I'd Like to add a "weight" column where if df['B'] >= 6 then df['weight'] = 20, else, df['weight'] = 1
So my output will be:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Approach #1
Here's one with type-conversion and scaling -
df['weight'] = (df['B'] >= 6)*19+1
Approach #2
Another possibly faster one with using the underlying array data -
df['weight'] = (df['B'].values >= 6)*19+1
Approach #3
Leverage multi-cores with numexpr module -
import numexpr as ne
val = df['B'].values
df['weight'] = ne.evaluate('(val >= 6)*19+1')
Timings on 500k rows as commented by OP for a random data in range [0,9) for the vectorized methods posted thus far -
In [149]: np.random.seed(0)
...: df = pd.DataFrame({'B' : np.random.randint(0,9,(500000))})
# #jpp's soln
In [150]: %timeit df['weight1'] = np.where(df['B'] >= 6, 20, 1)
100 loops, best of 3: 3.57 ms per loop
# #jpp's soln with array data
In [151]: %timeit df['weight2'] = np.where(df['B'].values >= 6, 20, 1)
100 loops, best of 3: 3.27 ms per loop
In [154]: %timeit df['weight3'] = (df['B'] >= 6)*19+1
100 loops, best of 3: 2.73 ms per loop
In [155]: %timeit df['weight4'] = (df['B'].values >= 6)*19+1
1000 loops, best of 3: 1.76 ms per loop
In [156]: %%timeit
...: val = df['B'].values
...: df['weight5'] = ne.evaluate('(val >= 6)*19+1')
1000 loops, best of 3: 1.14 ms per loop
One last one ...
With the output being 1 or 20, we could safely use lower precision : uint8 for a turbo speedup over already discussed ones, like so -
In [208]: %timeit df['weight6'] = (df['B'].values >= 6)*np.uint8(19)+1
1000 loops, best of 3: 428 µs per loop
You can use numpy.where for a vectorised solution:
df['weight'] = np.where(df['B'] >= 6, 20, 1)
Result:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Here's a method using df.apply
df['weight'] = df.apply(lambda row: 20 if row['B'] >= 6 else 1, axis=1)
Output:
In [6]: df
Out[6]:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20

Using list comprehension to number dataframes in a list of dataframes

I have a list of 4 dataframes, called df.
I'd like to add a "number" column to each dataframe (df[i]['number']) that represent the dataframe number.
I tried to use list comprehension for that:
df=[df['number']=(x+1) for x in range(0,4)]
Which resulted in
File "<ipython-input-52-0b708f543fbb>", line 1
df=[df['number']=(x+1) for x in range(0,4)]
^
SyntaxError: invalid syntax
I also tried:
df=[x['number']=(y+1) for x,y in enumerate(df)]
With the same result, pointing at the '=' sign.
What am I doing wrong?
Use enumerate, starting from 1 and assign to each dataframe in your list.
for i, d in enumerate(df, 1):
d['number'] = i
In-place assignment is much cheaper than assignment in a list comprehension.
df[0]
id marks
0 1 100
1 2 200
2 3 300
df[1]
name score flag
0 'abc' 100 T
1 'zxc' 300 F
for i, d in enumerate(df, 1):
d['number'] = i
df[0]
id marks number
0 1 100 1
1 2 200 1
2 3 300 1
df[1]
name score flag number
0 'abc' 100 T 2
1 'zxc' 300 F 2
Performance
Small
1000 loops, best of 3: 278 µs per loop # mine
vs
1000 loops, best of 3: 567 µs per loop # John Galt
Large (df * 10000)
1000 loops, best of 3: 607 µs per loop # mine
vs
1000 loops, best of 3: 1.16 ms per loop # John Galt - assign
1 loop, best of 1: 1.42 ms per loop # John Galt - side effects
Note that the loop-based assignment is also space efficient.
Use
1)
In [454]: df = [x.assign(number=i) for i, x in enumerate(df, 1)]
In [455]: df[0]
Out[455]:
0 1 number
0 0.068330 0.708835 1
1 0.877747 0.586654 1
In [456]: df[1]
Out[456]:
0 1 number
0 0.430418 0.477923 2
1 0.049980 0.018981 2
Good part you can assign it to a new variable without altering old list like
dff = [x.assign(number=i) for i, x in enumerate(df, 1)]
2)
If you want inplace and list comprehension
In [474]: [x.insert(x.shape[1] ,'number', i) for i, x in enumerate(df, 1)]
Out[474]: [None, None, None, None]
In [475]: df[0]
Out[475]:
0 1 number
0 0.207806 0.315701 1
1 0.464864 0.976156 1

pandas rolling max with groupby

I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:
df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])
looks like
id value
0 1 3
1 1 6
2 1 3
3 2 2
4 2 1
Now I wish to obtain the following DataFrame:
id value
0 1 3
1 1 6
2 1 6
3 2 2
4 2 2
The problem is that when I do
df.groupby('id')['value'].rolling(1).max()
I get the same DataFrame back. And when I do
df.groupby('id')['value'].rolling(3).max()
I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?
It looks like you need cummax() instead of .rolling(N).max()
In [29]: df['new'] = df.groupby('id').value.cummax()
In [30]: df
Out[30]:
id value new
0 1 3 3
1 1 6 6
2 1 3 6
3 2 2 2
4 2 1 2
Timing (using brand new Pandas version 0.20.1):
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [4]: df.shape
Out[4]: (50000, 2)
In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop
In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop
NOTE: from Pandas 0.20.0 what's new
Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
Using apply will be a tiny bit faster:
# Using apply
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop
Other method:
df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop

How to check if a particular cell in pandas DataFrame isnull?

I have the following df in pandas.
0 A B C
1 2 NaN 8
How can I check if df.iloc[1]['B'] is NaN?
I tried using df.isnan() and I get a table like this:
0 A B C
1 false true false
but I am not sure how to index the table and if this is an efficient way of performing the job at all?
Use pd.isnull, for select use loc or iloc:
print (df)
0 A B C
0 1 2 NaN 8
print (df.loc[0, 'B'])
nan
a = pd.isnull(df.loc[0, 'B'])
print (a)
True
print (df['B'].iloc[0])
nan
a = pd.isnull(df['B'].iloc[0])
print (a)
True
jezrael response is spot on. If you are only concern with NaN value, I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
If you are looking for the indexes of NaN in a specific column you can use
list(df['B'].index[df['B'].apply(np.isnan)])
In case you what to get the indexes of all possible NaN values in the dataframe you may do the following
row_col_indexes = list(map(list, np.where(np.isnan(np.array(df)))))
indexes = []
for i in zip(row_col_indexes[0], row_col_indexes[1]):
indexes.append(list(i))
And if you are looking for a one liner you can use:
list(zip(*[x for x in list(map(list, np.where(np.isnan(np.array(df)))))]))

pandas: conditional count across row

I have a dataframe that has months for columns, and various departments for rows.
2013April 2013May 2013June
Dep1 0 10 15
Dep2 10 15 20
I'm looking to add a column that counts the number of months that have a value greater than 0. Ex:
2013April 2013May 2013June Count>0
Dep1 0 10 15 2
Dep2 10 15 20 3
The number of columns this function needs to span is variable. I think defining a function then using .apply is the solution, but I can't seem to figure it out.
first, pick your columns, cols
df[cols].apply(lambda s: (s > 0).sum(), axis=1)
this takes advantage of the fact that True and False are 1 and 0 respectively in python.
actually, there's a better way:
(df[cols] > 0).sum(1)
because this takes advantage of numpy vectorization
%timeit df.apply(lambda s: (s > 0).sum(), axis=1)
10 loops, best of 3: 141 ms per loop
%timeit (df > 0).sum(1)
1000 loops, best of 3: 319 µs per loop

Categories

Resources