Adding dataframe columns together, separated by columns considering NaNs - python

How could NaN values be completely ommitted from the new column in order to avoid consecutive commas?
df['newcolumn'] = df.apply(''.join, axis=1)
One approach would probably be using a conditional lambda
df.apply(lambda x: ','.join(x.astype(str)) if(np.isnan(x.astype(str))) else '', axis = 1)
But this returns an error message:
TypeError: ("ufunc 'isnan' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''", 'occurred at index 0')
Edit:
Both your answers work. In order to obtain the answer, what critera would I use to determine which one to code? Performance considerations?

You can using stack , since it will remove the NaN by default
df.stack().groupby(level=0).apply(','.join)
Out[552]:
0 a,t,y
1 a,t
2 a,u,y
3 a,u,n
4 a,u
5 b,t,y
dtype: object
Data input
df
Out[553]:
Mary John David
0 a t y
1 a t NaN
2 a u y
3 a u n
4 a u NaN
5 b t y

you can use dropna in your apply such as:
df.apply(lambda x: ','.join(x.dropna()), axis = 1)
With #Wen input for df, if you compare for small df, this one is slightly faster
%timeit df.apply(lambda x: ','.join(x.dropna()),1)
1000 loops, best of 3: 1.04 ms per loop
%timeit df.stack().groupby(level=0).apply(','.join)
1000 loops, best of 3: 1.6 ms per loop
but for bigger dataframe, #Wen answer is way faster
df_long = pd.concat([df]*1000)
%timeit df_long.apply(lambda x: ','.join(x.dropna()),1)
1 loop, best of 3: 850 ms per loop
%timeit df_long.stack().groupby(level=0).apply(','.join)
100 loops, best of 3: 13.1 ms per loop

Related

Conditionally assign values to DF column using np.where

I'm trying to group some columns into a list inside a single column.
In case one of these columns contain NaN, the result column should be just NaN instead of the list.
df = pd.DataFrame({'a.0':[11, 1.1], 'a.5':[12, 1.2], 'a.10':[13, pd.np.NaN]})
The result column of the DF should be as such:
a.0 a.10 a.5 result
0 . 11.0 13.0 12.0 [11, 13, 12]
1 . 1.1 nan 1.2 nan
These 2 lines do the job:
df['result'] = df[['a.0','a.10','a.5']].values.tolist()
df['result'] = pd.np.where(df[['a.0','a.10','a.5']].isnull().any(axis=1), pd.np.nan, df['result'])
And I was wondering how to do it in one line. Help would be appreciated
Using isnull + any + mask/where
df['result'] = pd.Series(df.values.tolist()).mask(df.isnull().any(1))
Or,
df['result'] = pd.Series(df.values.tolist()).where(~df.isnull().any(1))
a.0 a.10 a.5 result
0 11.0 13.0 12.0 [11.0, 13.0, 12.0]
1 1.1 NaN 1.2 NaN
Performance
df = pd.concat([df] * 100000, ignore_index=True)
# Wen's solution
%timeit df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
1 loop, best of 3: 1min 37s per loop
# Anton vBR's solution
%timeit [np.nan if np.isnan(v).any() else list(v[1:]) for v in df.itertuples()]
1 loop, best of 3: 5.79 s per loop
# my answer
%timeit pd.Series(df.values.tolist()).mask(df.isnull().any(1))
10 loops, best of 3: 133 ms per loop
Conclusions
apply is inefficient, more so than a loop. An apply call typically has many overheads associated with it. A pure python loop is generally faster.
A vectorised approach such as where/mask will vectorize operations and offers improved performance over loopy solutions.
Update With timings and large data-sets cᴏʟᴅsᴘᴇᴇᴅ answer is the best. List comprehensions always suffer here. I've updated my previous answer with timings.
You could use itertuples and assign np.nan if any np.nan in row:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a.0':np.random.choice(np.append(np.nan,np.arange(10)), 1000),
'a.5':np.random.choice(10, 1000),
'a.10':np.random.choice(10, 1000)})
# 3 solutions to solve the problem
# Assign with df['results'] =
%timeit [np.nan if np.isnan(v).any() else list(v[1:]) for v in df.itertuples()]
%timeit pd.Series(df.values.tolist()).mask(df.isnull().any(1))
%timeit df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
Timings:
100 loops, best of 3: 8.38 ms per loop
1000 loops, best of 3: 772 µs per loop
1 loop, best of 3: 214 ms per loop
df['result']=df.apply(lambda x : pd.Series([x.tolist()]) if ~x.isnull().any() else np.nan,1)
df
Out[30]:
a.0 a.10 a.5 result
0 11.0 13.0 12.0 [11.0, 13.0, 12.0]
1 1.1 NaN 1.2 NaN

How to check if a particular cell in pandas DataFrame isnull?

I have the following df in pandas.
0 A B C
1 2 NaN 8
How can I check if df.iloc[1]['B'] is NaN?
I tried using df.isnan() and I get a table like this:
0 A B C
1 false true false
but I am not sure how to index the table and if this is an efficient way of performing the job at all?
Use pd.isnull, for select use loc or iloc:
print (df)
0 A B C
0 1 2 NaN 8
print (df.loc[0, 'B'])
nan
a = pd.isnull(df.loc[0, 'B'])
print (a)
True
print (df['B'].iloc[0])
nan
a = pd.isnull(df['B'].iloc[0])
print (a)
True
jezrael response is spot on. If you are only concern with NaN value, I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
If you are looking for the indexes of NaN in a specific column you can use
list(df['B'].index[df['B'].apply(np.isnan)])
In case you what to get the indexes of all possible NaN values in the dataframe you may do the following
row_col_indexes = list(map(list, np.where(np.isnan(np.array(df)))))
indexes = []
for i in zip(row_col_indexes[0], row_col_indexes[1]):
indexes.append(list(i))
And if you are looking for a one liner you can use:
list(zip(*[x for x in list(map(list, np.where(np.isnan(np.array(df)))))]))

Pandas DataFrame: How to quickly get first non-NaN value in each row? [duplicate]

If I've got a DataFrame in pandas which looks something like:
A B C
0 1 NaN 2
1 NaN 3 NaN
2 NaN 4 5
3 NaN NaN NaN
How can I get the first non-null value from each row? E.g. for the above, I'd like to get: [1, 3, 4, None] (or equivalent Series).
Fill the nans from the left with fillna, then get the leftmost column:
df.fillna(method='bfill', axis=1).iloc[:, 0]
This is a really messy way to do this, first use first_valid_index to get the valid columns, convert the returned series to a dataframe so we can call apply row-wise and use this to index back to original df:
In [160]:
def func(x):
if x.values[0] is None:
return None
else:
return df.loc[x.name, x.values[0]]
pd.DataFrame(df.apply(lambda x: x.first_valid_index(), axis=1)).apply(func,axis=1)
​
Out[160]:
0 1
1 3
2 4
3 NaN
dtype: float64
EDIT
A slightly cleaner way:
In [12]:
def func(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df.apply(func, axis=1)
Out[12]:
0 1
1 3
2 4
3 NaN
dtype: float64
I'm going to weigh in here as I think this is a good deal faster than any of the proposed methods. argmin gives the index of the first False value in each row of the result of np.isnan in a vectorized way, which is the hard part. It still relies on a Python loop to extract the values but the look up is very quick:
def get_first_non_null(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return [a[row, col] for row, col in enumerate(col_index)]
EDIT:
Here's a fully vectorized solution which is can be a good deal faster again depending on the shape of the input. Updated benchmarking below.
def get_first_non_null_vec(df):
a = df.values
n_rows, n_cols = a.shape
col_index = np.isnan(a).argmin(axis=1)
flat_index = n_cols * np.arange(n_rows) + col_index
return a.ravel()[flat_index]
If a row is completely null then the corresponding value will be null also.
Here's some benchmarking against unutbu's solution:
df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 220 ms per loop
100 loops, best of 3: 16.2 ms per loop
100 loops, best of 3: 12.6 ms per loop
In [109]:
df = pd.DataFrame(np.random.choice([1, np.nan], (100000, 150), p=(0.01, 0.99)))
#%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 246 ms per loop
10 loops, best of 3: 48.2 ms per loop
100 loops, best of 3: 15.7 ms per loop
df = pd.DataFrame(np.random.choice([1, np.nan], (1000000, 15), p=(0.01, 0.99)))
%timeit df.stack().groupby(level=0).first().reindex(df.index)
%timeit get_first_non_null(df)
%timeit get_first_non_null_vec(df)
1 loops, best of 3: 326 ms per loop
1 loops, best of 3: 326 ms per loop
10 loops, best of 3: 35.7 ms per loop
Here is another way to do it:
In [183]: df.stack().groupby(level=0).first().reindex(df.index)
Out[183]:
0 1
1 3
2 4
3 NaN
dtype: float64
The idea here is to use stack to move the columns into a row index level:
In [184]: df.stack()
Out[184]:
0 A 1
C 2
1 B 3
2 B 4
C 5
dtype: float64
Now, if you group by the first row level -- i.e. the original index -- and take the first value from each group, you essentially get the desired result:
In [185]: df.stack().groupby(level=0).first()
Out[185]:
0 1
1 3
2 4
dtype: float64
All we need to do is reindex the result (using the original index) so as to
include rows that are completely NaN:
df.stack().groupby(level=0).first().reindex(df.index)
This is nothing new, but it's a combination of the best bits of #yangie's approach with a list comprehension, and #EdChum's df.apply approach that I think is easiest to understand.
First, which columns to we want to pick our values from?
In [95]: pick_cols = df.apply(pd.Series.first_valid_index, axis=1)
In [96]: pick_cols
Out[96]:
0 A
1 B
2 B
3 None
dtype: object
Now how do we pick the values?
In [100]: [df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()]
Out[100]: [1.0, 3.0, 4.0, None]
This is ok, but we really want the index to match that of the original DataFrame:
In [98]: pd.Series({k:df.loc[k, v] if v is not None else None
....: for k, v in pick_cols.iteritems()})
Out[98]:
0 1
1 3
2 4
3 NaN
dtype: float64
groupby in axis=1
If we pass a callable that returns the same value, we group all columns together. This allows us to use groupby.agg which gives us the first method that makes this easy
df.groupby(lambda x: 'Z', 1).first()
Z
0 1.0
1 3.0
2 4.0
3 NaN
This returns a dataframe with the column name of the thing I was returning in my callable
lookup, notna, and idxmax
df.lookup(df.index, df.notna().idxmax(1))
array([ 1., 3., 4., nan])
argmin and slicing
v = df.values
v[np.arange(len(df)), np.isnan(v).argmin(1)]
array([ 1., 3., 4., nan])
Here is a one line solution:
[row[row.first_valid_index()] if row.first_valid_index() else None for _, row in df.iterrows()]
Edit:
This solution iterates over rows of df. row.first_valid_index() returns label for first non-NA/null value, which will be used as index to get the first non-null item in each row.
If there is no non-null value in the row, row.first_valid_index() would be None, thus cannot be used as index, so I need a if-else statement.
I packed everything into a list comprehension for brevity.
JoeCondron's answer (EDIT: before his last edit!) is cool but there is margin for significant improvement by avoiding the non-vectorized enumeration:
def get_first_non_null_vect(df):
a = df.values
col_index = np.isnan(a).argmin(axis=1)
return a[np.arange(a.shape[0]), col_index]
The improvement is small if the DataFrame is relatively flat:
In [4]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 1500), p=(0.01, 0.99)))
In [5]: %timeit get_first_non_null(df)
10 loops, best of 3: 34.9 ms per loop
In [6]: %timeit get_first_non_null_vect(df)
10 loops, best of 3: 31.6 ms per loop
... but can be relevant on slim DataFrames:
In [7]: df = pd.DataFrame(np.random.choice([1, np.nan], (10000, 15), p=(0.1, 0.9)))
In [8]: %timeit get_first_non_null(df)
100 loops, best of 3: 3.75 ms per loop
In [9]: %timeit get_first_non_null_vect(df)
1000 loops, best of 3: 718 µs per loop
Compared to JoeCondron's vectorized version, the runtime is very similar (this is still slightly quicker for slim DataFrames, and slightly slower for large ones).
df=pandas.DataFrame({'A':[1, numpy.nan, numpy.nan, numpy.nan], 'B':[numpy.nan, 3, 4, numpy.nan], 'C':[2, numpy.nan, 5, numpy.nan]})
df
A B C
0 1.0 NaN 2.0
1 NaN 3.0 NaN
2 NaN 4.0 5.0
3 NaN NaN NaN
df.apply(lambda x: numpy.nan if all(x.isnull()) else x[x.first_valid_index()], axis=1).tolist()
[1.0, 3.0, 4.0, nan]

Pandas DataFrame filtering

Let's say I have a DataFrame with four columns, each of which has a threshold value against which I'd like to compare the DataFrame's values.
I would simply like the minimum value of the DataFrame or the threshold.
For example:
df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))
>>> df.head()
A B C D
0 -2.060410 -1.390896 -0.595792 -0.374427
1 0.660580 0.726795 -1.326431 -1.488186
2 -0.955792 -1.852701 -0.895178 -1.353669
3 -1.002576 -0.321210 1.711597 -0.063274
4 1.217197 0.202063 -1.407561 0.940371
thresholds = pd.Series({'A': 1, 'B': 1.1, 'C': 1.2, 'D': 1.3})
This solution works (A4 and C3 were filtered), but there must be an easier way:
df_filtered = df.lt(thresholds).multiply(df) + df.gt(thresholds).multiply(thresholds)
>>> df_filtered.head()
A B C D
0 -2.060410 -1.390896 -0.595792 -0.374427
1 0.660580 0.726795 -1.326431 -1.488186
2 -0.955792 -1.852701 -0.895178 -1.353669
3 -1.002576 -0.321210 1.200000 -0.063274
4 1.000000 0.202063 -1.407561 0.940371
Ideally, I'd like to use .loc to filter in place, but I haven't managed to figure it out. I'm using Pandas 0.14.1 (and can't upgrade).
RESPONSE Below are the timed tests of my initial proposal against the alternatives:
%%timeit
df.lt(thresholds).multiply(df) + df.gt(thresholds).multiply(thresholds)
1000 loops, best of 3: 990 µs per loop
%%timeit
np.minimum(df, thresholds) # <--- Simple, fast, and returns DataFrame!
10000 loops, best of 3: 110 µs per loop
%%timeit
df[df < thresholds].fillna(thresholds, inplace=True)
1000 loops, best of 3: 1.36 ms per loop
This is pretty fast (and returns a dataframe):
np.minimum( df, [1.0,1.1,1.2,1.3] )
A pleasant surprise that numpy is so amenable to this without any reshaping or explicit conversions...
How about:
df[df < thresholds].fillna(thresholds, inplace=True)

pandas: conditional count across row

I have a dataframe that has months for columns, and various departments for rows.
2013April 2013May 2013June
Dep1 0 10 15
Dep2 10 15 20
I'm looking to add a column that counts the number of months that have a value greater than 0. Ex:
2013April 2013May 2013June Count>0
Dep1 0 10 15 2
Dep2 10 15 20 3
The number of columns this function needs to span is variable. I think defining a function then using .apply is the solution, but I can't seem to figure it out.
first, pick your columns, cols
df[cols].apply(lambda s: (s > 0).sum(), axis=1)
this takes advantage of the fact that True and False are 1 and 0 respectively in python.
actually, there's a better way:
(df[cols] > 0).sum(1)
because this takes advantage of numpy vectorization
%timeit df.apply(lambda s: (s > 0).sum(), axis=1)
10 loops, best of 3: 141 ms per loop
%timeit (df > 0).sum(1)
1000 loops, best of 3: 319 µs per loop

Categories

Resources