How to replace NaNs with valid value from within Pandas group

How to replace NaNs with valid value from within Pandas group - python

I want to replace NaNs in a Pandas DataFrame column with non-NaN values from within the same group. In my case these are geo coordinates where for some reason some data points the lookup failed. e.g.:
df.groupby('place')
looks like
place| lat | lng
-----------------
foo | NaN | NaN
foo | 1 | 4
foo | 1 | 4
foo | NaN | NaN
bar | 5 | 7
bar | 5 | 7
bar | NaN | NaN
bar | NaN | NaN
bar | 5 | 7
==> what I want:
foo | 1 | 4
foo | 1 | 4
foo | 1 | 4
foo | 1 | 4
bar | 5 | 7
bar | 5 | 7
bar | 5 | 7
bar | 5 | 7
bar | 5 | 7
In my case the lat/lng values within the same 'place' grouping are constant, so picking any non-NaN value would work. I'm also curious how I could do a fill with e.g. mean/majority count.

Using groupby along with ffill and bfill
df[['lat', 'lng']]=df.groupby('place').ffill().bfill()
df:
place lat lng
0 foo 1 4
1 foo 1 4
2 foo 1 4
3 foo 1 4
4 bar 5 7
5 bar 5 7
6 bar 5 7
7 bar 5 7
8 bar 5 7

If you have the same values in a given group, the following should work:
df = df.fillna(method = 'ffill').fillna(method = 'bfill')

Fill up nan with first valid value in each group
df.fillna(df.groupby('place').transform('first'))
place lat lng
0 foo 1.0 4.0
1 foo 1.0 4.0
2 foo 1.0 4.0
3 foo 1.0 4.0
4 bar 5.0 7.0
5 bar 5.0 7.0
6 bar 5.0 7.0
7 bar 5.0 7.0
8 bar 5.0 7.0

Related

Count occurrences of specific value in column based on categories of another column

I have a dataset that looks like this:
Categories | Clicks
1 | 1
1 | 3
1 | 2
2 | 2
2 | 1
2 | 1
2 | 2
3 | 1
3 | 2
3 | 3
4 | 2
4 | 1
And to make some bar plots I would like for it to look like this:
Categories | Clicks_count | Clicks_prob
1 | 1 | 33%
2 | 2 | 50%
3 | 1 | 33%
4 | 1 | 50%
so basically: grouping by Categories and calculating on Clicks_count the number of times per category that Clicks takes the value 1, and Clicks_prob the probability of Clicks taking the value 1 (so it's count of Clicks==1/Count of Category i observations)
How could I do this? I tried, to get the second column:
df.groupby("Categories")["Clicks"].count().reset_index()
but the result is:
Categories | Clicks
1 | 3
2 | 4
3 | 3
4 | 2

Try sum and mean on the condition Clicks==1. Since you're working with groups, put them in groupby:
df['Clicks'].eq(1).groupby(df['Categories']).agg(['sum','mean'])
Output:
sum mean
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000
To match output's naming, use named aggregation:
df['Clicks'].eq(1).groupby(df['Categories']).agg(Click_counts='sum', Clicks_prob='mean')
Output:
Click_counts Clicks_prob
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000

Apply function on dataframe counting power of difference between int and series

I am trying to add a new column to dataframe with apply function. I need to count distance between X and Y coords in row 0 and all other rows, I have created following logic:
import pandas as pd
import numpy as np
data = {'X':[0,0,0,1,1,5,6,7,8],'Y':[0,1,4,2,6,5,6,4,8],'Value':[6,7,4,5,6,5,6,4,8]}
df = pd.DataFrame(data)
def countDistance(lat1, lon1, lat2, lon2):
print(lat1, lon1, lat2, lon2)
#use basic knowledge about triangles - values are in meters
distance = np.sqrt(np.power(lat1-lat2,2)+np.power(lon1-lon2,2))
return distance
def recModif(df):
x = df.loc[0,'X']
y = df.loc[0,'Y']
df['dist'] = df.apply(lambda n: countDistance(x,y,df['X'],df['Y']), axis=1)
#more code will come here
recModif(df)
But this always returns error: ValueError: Wrong number of items passed 9, placement implies
I thought that as x and y are scalars, using np.repeat might help but it didn't, the error was still the same. I saw similar posts such as this one, but with multiplication which is simple, how can I achieve subtraction like I need?

The variable name in .apply() was messed up and collides with the outer scope. Avoid that and the code works.
df['dist'] = df.apply(lambda row: countDistance(x,y,row['X'],row['Y']), axis=1)
df
X Y Value dist
0 0 0 6 0.000000
1 0 1 7 1.000000
2 0 4 4 4.000000
3 1 2 5 2.236068
4 1 6 6 6.082763
5 5 5 5 7.071068
6 6 6 6 8.485281
7 7 4 4 8.062258
8 8 8 8 11.313708
Also note that np.power() and np.sqrt() are already vectorized, so .apply itself is redundant for the dataset given:
countDistance(x,y,df['X'],df['Y'])
Out[154]:
0 0.000000
1 1.000000
2 4.000000
3 2.236068
4 6.082763
5 7.071068
6 8.485281
7 8.062258
8 11.313708
dtype: float64

To achieve your end goal I suggest changing the function recModif to:
def recModif(df):
x = df.loc[0,'X']
y = df.loc[0,'Y']
df['dist'] = countDistance(x,y,df['X'],df['Y'])
#more code will come here
This outputs
X Y Value dist
0 0 0 6 0.000000
1 0 1 7 1.000000
2 0 4 4 4.000000
3 1 2 5 2.236068
4 1 6 6 6.082763
5 5 5 5 7.071068
6 6 6 6 8.485281
7 7 4 4 8.062258
8 8 8 8 11.313708

Solution
Try this:
## Method-1
df['dist'] = ((df.X - df.X[0])**2 + (df.Y - df.Y[0])**2)**0.5
## Method-2: .apply()
x, y = df.X[0], df.Y[0]
df['dist'] = df.apply(lambda row: ((row.X - x)**2 + (row.Y - y)**2)**0.5, axis=1)
Output:
# print(df.to_markdown(index=False))
| X | Y | Value | dist |
|----:|----:|--------:|---------:|
| 0 | 0 | 6 | 0 |
| 0 | 1 | 7 | 1 |
| 0 | 4 | 4 | 4 |
| 1 | 2 | 5 | 2.23607 |
| 1 | 6 | 6 | 6.08276 |
| 5 | 5 | 5 | 7.07107 |
| 6 | 6 | 6 | 8.48528 |
| 7 | 4 | 4 | 8.06226 |
| 8 | 8 | 8 | 11.3137 |
Dummy Data
import pandas as pd
data = {
'X': [0,0,0,1,1,5,6,7,8],
'Y': [0,1,4,2,6,5,6,4,8],
'Value':[6,7,4,5,6,5,6,4,8]
}
df = pd.DataFrame(data)

descriptive stats for two categorical variables (pandas)

I need to get the mean and median for frequencies between two categorical variables. E.g.:
Label Letter Num
Foo | A | 1
Foo | B | 2
Foo | C | 4
Bar | A | 2
Bar | G | 3
Bar | N | 1
Bar | P | 2
Cee | B | 1
Cee | B | 2
Cee | C | 4
Cee | D | 5
For instance, what's the mean and median number of letters per label. Here there are 11 cases out of three possible labels (M=3.667) and the median is 4 (3 foo, 4 bar, 4 cee). How can I calculate this in pandas. Is it possible to do this with a groupby statement? My data set is much larger than this.

Need value_counts for one column or groupby + size (or count if need omit NaNs):
a = df['Label'].value_counts()
print (a)
Cee 4
Bar 4
Foo 3
Name: Label, dtype: int64
#alternative
#a = df.groupby('Label').size()
print (a.mean())
3.6666666666666665
print (a.median())
4.0
a = df.groupby(['Label','Letter']).size()
print (a)
Label Letter
Bar A 1
G 1
N 1
P 1
Cee B 2
C 1
D 1
Foo A 1
B 1
C 1
dtype: int64
print (a.mean())
1.1
print (a.median())
1.0

Python pandas - remove groups based on NaN count threshold

I have a dataset based on different weather stations,
stationID | Time | Temperature | ...
----------+------+-------------+-------
123 | 1 | 30 |
123 | 2 | 31 |
202 | 1 | 24 |
202 | 2 | 24.3 |
202 | 3 | NaN |
...
And I would like to remove 'stationID' groups, which have more than a certain number of NaNs. For instance, if I type:
**>>> df.groupby('stationID')**
then, I would like to drop groups that have (at least) a certain number of NaNs (say 30) within a group. As I understand it, I cannot use dropna(thresh=10) with groupby:
**>>> df2.groupby('station').dropna(thresh=30)**
*AttributeError: Cannot access callable attribute 'dropna' of 'DataFrameGroupBy' objects...*
So, what would be the best way to do that with Pandas?

IIUC you can do df2.loc[df2.groupby('station')['Temperature'].filter(lambda x: len(x[pd.isnull(x)] ) < 30).index]
Example:
In [59]:
df = pd.DataFrame({'id':[0,0,0,1,1,1,2,2,2,2], 'val':[1,1,np.nan,1,np.nan,np.nan, 1,1,1,1]})
df
Out[59]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
3 1 1.0
4 1 NaN
5 1 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
In [64]:
df.loc[df.groupby('id')['val'].filter(lambda x: len(x[pd.isnull(x)] ) < 2).index]
Out[64]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
So this will filter out the groups that have more than 1 nan values

You can create a column to give the the number of null values by station_id, and then use loc to select the relevant data for further processing.
df['station_id_null_count'] = \
df.groupby('stationID').Temperature.transform(lambda group: group.isnull().sum())
df.loc[df.station_id_null_count > 30, :] # Select relevant data

Using #EdChum setup: Since you dont mention your final output, adding this.
vals = df.groupby(['id'])['val'].apply(lambda x: (np.size(x)-x.count()) < 2 )
vals[vals]
id
0 True
2 True
Name: val, dtype: bool

Pandas data frame: adding columns based on previous time periods

I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?

You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to replace NaNs with valid value from within Pandas group - python

Using groupby along with ffill and bfill df[['lat', 'lng']]=df.groupby('place').ffill().bfill() df: place lat lng 0 foo 1 4 1 foo 1 4 2 foo 1 4 3 foo 1 4 4 bar 5 7 5 bar 5 7 6 bar 5 7 7 bar 5 7 8 bar 5 7

If you have the same values in a given group, the following should work: df = df.fillna(method = 'ffill').fillna(method = 'bfill')

Fill up nan with first valid value in each group df.fillna(df.groupby('place').transform('first')) place lat lng 0 foo 1.0 4.0 1 foo 1.0 4.0 2 foo 1.0 4.0 3 foo 1.0 4.0 4 bar 5.0 7.0 5 bar 5.0 7.0 6 bar 5.0 7.0 7 bar 5.0 7.0 8 bar 5.0 7.0

Related

Count occurrences of specific value in column based on categories of another column

Apply function on dataframe counting power of difference between int and series

descriptive stats for two categorical variables (pandas)

Python pandas - remove groups based on NaN count threshold

Pandas data frame: adding columns based on previous time periods

Categories

Resources