Python pandas - remove groups based on NaN count threshold - python

I have a dataset based on different weather stations,
stationID | Time | Temperature | ...
----------+------+-------------+-------
123 | 1 | 30 |
123 | 2 | 31 |
202 | 1 | 24 |
202 | 2 | 24.3 |
202 | 3 | NaN |
...
And I would like to remove 'stationID' groups, which have more than a certain number of NaNs. For instance, if I type:
**>>> df.groupby('stationID')**
then, I would like to drop groups that have (at least) a certain number of NaNs (say 30) within a group. As I understand it, I cannot use dropna(thresh=10) with groupby:
**>>> df2.groupby('station').dropna(thresh=30)**
*AttributeError: Cannot access callable attribute 'dropna' of 'DataFrameGroupBy' objects...*
So, what would be the best way to do that with Pandas?

IIUC you can do df2.loc[df2.groupby('station')['Temperature'].filter(lambda x: len(x[pd.isnull(x)] ) < 30).index]
Example:
In [59]:
df = pd.DataFrame({'id':[0,0,0,1,1,1,2,2,2,2], 'val':[1,1,np.nan,1,np.nan,np.nan, 1,1,1,1]})
df
Out[59]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
3 1 1.0
4 1 NaN
5 1 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
In [64]:
df.loc[df.groupby('id')['val'].filter(lambda x: len(x[pd.isnull(x)] ) < 2).index]
Out[64]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
So this will filter out the groups that have more than 1 nan values

You can create a column to give the the number of null values by station_id, and then use loc to select the relevant data for further processing.
df['station_id_null_count'] = \
df.groupby('stationID').Temperature.transform(lambda group: group.isnull().sum())
df.loc[df.station_id_null_count > 30, :] # Select relevant data

Using #EdChum setup: Since you dont mention your final output, adding this.
vals = df.groupby(['id'])['val'].apply(lambda x: (np.size(x)-x.count()) < 2 )
vals[vals]
id
0 True
2 True
Name: val, dtype: bool

Related

Converting repeating rows to columns in pandas dataframe [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I am trying to convert a dataframe with repeating rows into columns as follows
INPUT
Key | Value
A | 1
B | 2
C | 3
A | 4
B | 5
C | 6
EXPECTED OUTPUT
A | B | C
1 | 2 | 3
4 | 5 | 6
There are a lot of options like pivot(), unstack(), groupby(), etc. But, I was unsure of using it with just 2 columns as shown in the input.
Its not a straight-forward pivot. Do this using df.pivot with df.apply and Series.dropna:
In [747]: x = df.pivot(index=None, columns='Key', values='Value').apply(lambda x: pd.Series(x.dropna().to_numpy()))
In [748]: x
Out[748]:
Key A B C
0 1.0 2.0 3.0
1 4.0 5.0 6.0
Explanation:
Let's break it down:
First you pivot your df like this:
In [751]: y = df.pivot(index=None, columns='Key', values='Value')
In [752]: y
Out[752]:
Key A B C
0 1.0 NaN NaN
1 NaN 2.0 NaN
2 NaN NaN 3.0
3 4.0 NaN NaN
4 NaN 5.0 NaN
5 NaN NaN 6.0
Now we are close to your expected output, but we need to remove Nan and collapse the 6 rows into 2 rows.
For that, we convert each column to a pd.Series and dropna():
In [753]: y.apply(lambda x: pd.Series(x.dropna().to_numpy()))
Out[753]:
Key A B C
0 1.0 2.0 3.0
1 4.0 5.0 6.0
This is your final output.

How to filter all the rows that contain ''isolated'' nan values in a column in python

I have a column in a pandas dataframe where some of the rows have NaN values.
I would like to select the rows that satisfy these conditions :
- they are NaN values;
- they are directly followed OR are ahead of non-null values
For example, I would like to select the rows for which there is this nan value :
input:
index | Col
...
1 | 1344
2 | NaN
3 | 532
...
desired ouptut :
2 | NaN
But I don't want to select these nan values (as they are followed by a NaN value or are right after another NaN value) :
index | Col
...
1 | 1344
2 | NaN
3 | NaN
4 | 532
...
Any help would be much appreciated
Thank you!
Below I show you how to do it with an example.On the one hand, Series.notna + Series.cumsum + Series.shift is used to group consecutive NaN values ​​through groupby. Using transform you get a Boolean Series with False in those groups that have more than one NaN. the AND operation of this Boolean series with the resulting series of df2['col2']. isna() is the series we are looking for to perform the Boolean indexing and select those rows where there is NaN but not consecutively
df=pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],'col2':[np.nan,2,3,np.nan,np.nan,6,np.nan,8,9,np.nan]})
print(df)
col1 col2
0 1 NaN
1 2 2.0
2 3 3.0
3 4 NaN
4 5 NaN
5 6 6.0
6 7 NaN
7 8 8.0
8 9 9.0
9 10 NaN
mask_repeat_NaN=df.groupby(df['col2'].notna().cumsum())['col2'].transform('size').le(2)
mask=mask_repeat_NaN&df['col2'].isna()
df_filtered=df[mask]
print(df_filtered)
col1 col2
0 1 NaN
6 7 NaN
9 10 NaN

Calculating rolling retention with Python [duplicate]

This question already has answers here:
Python: get a frequency count based on two columns (variables) in pandas dataframe some row appears
(3 answers)
Closed 3 years ago.
I am having trouble calculating rolling retention.
I was trying to figure out how to make groupby work, but it seems like it suits only for calculating classic retention.
Rolling retention - cound amount of users from each group who logged in on the exact month OR later.
data = {'id':[1, 1, 1, 2, 2, 2, 2, 3, 3],
'group_month': ['2013-05', '2013-05', '2013-05', '2013-06', '2013-06', '2013-06', '2013-06', '2013-06', '2013-06'],
'login_month': ['2013-05', '2013-06', '2013-07', '2013-06', '2013-07', '2013-09', '2013-10', '2013-09', '2013-10']}
Transforming data:
data = pd.DataFrame(data)
pd.to_datetime(data['group_month'], format='%Y-%m', errors='coerce')
pd.to_datetime(data['login_month'], format='%Y-%m', errors='coerce')
To calculate classic retention (count users from each cohort who logged in on the exact month I used following code:
classic_ret = pd.DataFrame(data[(data['login_month'] >= data['group_month'])].groupby(['group_month', 'login_month'])['id'].count())
classic_ret.unstack()
Rolling retention should have the following output:
+-------------+---------+---------+---------+---------+---------+---------+
| group_month | 2013-05 | 2013-06 | 2013-07 | 2013-08 | 2013-09 | 2013-10 |
+-------------+---------+---------+---------+---------+---------+---------+
| 2013-05 | 1 | 1 | 1 | 1 | 1 | 1 |
| 2013-06 | 0 | 1 | 1 | 1 | 2 | 2 |
+-------------+---------+---------+---------+---------+---------+---------+
With cross tab, i could only manage the table below.
a = data.set_index('login_month').groupby('id').resample('M').last().ffill().drop('id', axis=1).reset_index()
pd.crosstab(a.group_month, a.login_month)
Output
login_month 2013-05-31 2013-06-30 2013-07-31 2013-08-31 2013-09-30 2013-10-31
group_month
2013-05-01 1 1 1 0 0 0
2013-06-01 0 1 1 1 2 2
However, we could get the values you need as below.
a = data.set_index('login_month').groupby('id').resample('M').last().ffill().drop('id', axis=1).reset_index()
pd.DataFrame(a[(a['login_month'] >= a['group_month'])].groupby(['group_month', 'login_month'])['id'].count()).unstack().fillna(method='ffill',axis=1).fillna(value=0)
output
login_month 2013-05-31 2013-06-30 2013-07-31 2013-08-31 2013-09-30 2013-10-31
group_month
2013-05-01 1.0 1.0 1.0 1.0 1.0 1.0
2013-06-01 0.0 1.0 1.0 1.0 2.0 2.0

Calculate difference in Column B when there is a Match in Column A

I would the differences in Column B between DF 1 and DF 2 based on string matches in Column A. The data frames have hundreds of rows and may not be in order. They look something like this:
df1
A B
0 a,b 2.0
1 d,c 1.4
2 a,c 1.8
3 c,d 5.4
4 m,m 2.0
df2
A B
0 c,d 2.1
1 a,b 2.2
2 k,k 3.0
3 a,d 2.0
4 m,m 1.2
and the desired output would be based on DF 1 and return NaN if there is no match. It would look like:
DF Result
__| A | B |
0 |'a,b' | -0.2 |
1 |'d,c' | NaN |
2 |'a,c' | NaN |
3 |'c,d' | 3.3 |
4 |'m,m' | 0.8 |
Any help is much appreciated. Thank you!
Perform index-aligned subtraction.
(df1.set_index('A').B - df2.set_index('A').reindex(df1.A).B).reset_index()
A B
0 a,b -0.2
1 d,c NaN
2 a,c NaN
3 c,d 3.3
4 m,m 0.8

Pandas data frame: adding columns based on previous time periods

I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?
You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).

Categories

Resources