I have a dataset that looks like this:
Categories | Clicks
1 | 1
1 | 3
1 | 2
2 | 2
2 | 1
2 | 1
2 | 2
3 | 1
3 | 2
3 | 3
4 | 2
4 | 1
And to make some bar plots I would like for it to look like this:
Categories | Clicks_count | Clicks_prob
1 | 1 | 33%
2 | 2 | 50%
3 | 1 | 33%
4 | 1 | 50%
so basically: grouping by Categories and calculating on Clicks_count the number of times per category that Clicks takes the value 1, and Clicks_prob the probability of Clicks taking the value 1 (so it's count of Clicks==1/Count of Category i observations)
How could I do this? I tried, to get the second column:
df.groupby("Categories")["Clicks"].count().reset_index()
but the result is:
Categories | Clicks
1 | 3
2 | 4
3 | 3
4 | 2
Try sum and mean on the condition Clicks==1. Since you're working with groups, put them in groupby:
df['Clicks'].eq(1).groupby(df['Categories']).agg(['sum','mean'])
Output:
sum mean
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000
To match output's naming, use named aggregation:
df['Clicks'].eq(1).groupby(df['Categories']).agg(Click_counts='sum', Clicks_prob='mean')
Output:
Click_counts Clicks_prob
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000
I am trying to add a new column to dataframe with apply function. I need to count distance between X and Y coords in row 0 and all other rows, I have created following logic:
import pandas as pd
import numpy as np
data = {'X':[0,0,0,1,1,5,6,7,8],'Y':[0,1,4,2,6,5,6,4,8],'Value':[6,7,4,5,6,5,6,4,8]}
df = pd.DataFrame(data)
def countDistance(lat1, lon1, lat2, lon2):
print(lat1, lon1, lat2, lon2)
#use basic knowledge about triangles - values are in meters
distance = np.sqrt(np.power(lat1-lat2,2)+np.power(lon1-lon2,2))
return distance
def recModif(df):
x = df.loc[0,'X']
y = df.loc[0,'Y']
df['dist'] = df.apply(lambda n: countDistance(x,y,df['X'],df['Y']), axis=1)
#more code will come here
recModif(df)
But this always returns error: ValueError: Wrong number of items passed 9, placement implies
I thought that as x and y are scalars, using np.repeat might help but it didn't, the error was still the same. I saw similar posts such as this one, but with multiplication which is simple, how can I achieve subtraction like I need?
The variable name in .apply() was messed up and collides with the outer scope. Avoid that and the code works.
df['dist'] = df.apply(lambda row: countDistance(x,y,row['X'],row['Y']), axis=1)
df
X Y Value dist
0 0 0 6 0.000000
1 0 1 7 1.000000
2 0 4 4 4.000000
3 1 2 5 2.236068
4 1 6 6 6.082763
5 5 5 5 7.071068
6 6 6 6 8.485281
7 7 4 4 8.062258
8 8 8 8 11.313708
Also note that np.power() and np.sqrt() are already vectorized, so .apply itself is redundant for the dataset given:
countDistance(x,y,df['X'],df['Y'])
Out[154]:
0 0.000000
1 1.000000
2 4.000000
3 2.236068
4 6.082763
5 7.071068
6 8.485281
7 8.062258
8 11.313708
dtype: float64
To achieve your end goal I suggest changing the function recModif to:
def recModif(df):
x = df.loc[0,'X']
y = df.loc[0,'Y']
df['dist'] = countDistance(x,y,df['X'],df['Y'])
#more code will come here
This outputs
X Y Value dist
0 0 0 6 0.000000
1 0 1 7 1.000000
2 0 4 4 4.000000
3 1 2 5 2.236068
4 1 6 6 6.082763
5 5 5 5 7.071068
6 6 6 6 8.485281
7 7 4 4 8.062258
8 8 8 8 11.313708
Solution
Try this:
## Method-1
df['dist'] = ((df.X - df.X[0])**2 + (df.Y - df.Y[0])**2)**0.5
## Method-2: .apply()
x, y = df.X[0], df.Y[0]
df['dist'] = df.apply(lambda row: ((row.X - x)**2 + (row.Y - y)**2)**0.5, axis=1)
Output:
# print(df.to_markdown(index=False))
| X | Y | Value | dist |
|----:|----:|--------:|---------:|
| 0 | 0 | 6 | 0 |
| 0 | 1 | 7 | 1 |
| 0 | 4 | 4 | 4 |
| 1 | 2 | 5 | 2.23607 |
| 1 | 6 | 6 | 6.08276 |
| 5 | 5 | 5 | 7.07107 |
| 6 | 6 | 6 | 8.48528 |
| 7 | 4 | 4 | 8.06226 |
| 8 | 8 | 8 | 11.3137 |
Dummy Data
import pandas as pd
data = {
'X': [0,0,0,1,1,5,6,7,8],
'Y': [0,1,4,2,6,5,6,4,8],
'Value':[6,7,4,5,6,5,6,4,8]
}
df = pd.DataFrame(data)
I need to get the mean and median for frequencies between two categorical variables. E.g.:
Label Letter Num
Foo | A | 1
Foo | B | 2
Foo | C | 4
Bar | A | 2
Bar | G | 3
Bar | N | 1
Bar | P | 2
Cee | B | 1
Cee | B | 2
Cee | C | 4
Cee | D | 5
For instance, what's the mean and median number of letters per label. Here there are 11 cases out of three possible labels (M=3.667) and the median is 4 (3 foo, 4 bar, 4 cee). How can I calculate this in pandas. Is it possible to do this with a groupby statement? My data set is much larger than this.
Need value_counts for one column or groupby + size (or count if need omit NaNs):
a = df['Label'].value_counts()
print (a)
Cee 4
Bar 4
Foo 3
Name: Label, dtype: int64
#alternative
#a = df.groupby('Label').size()
print (a.mean())
3.6666666666666665
print (a.median())
4.0
a = df.groupby(['Label','Letter']).size()
print (a)
Label Letter
Bar A 1
G 1
N 1
P 1
Cee B 2
C 1
D 1
Foo A 1
B 1
C 1
dtype: int64
print (a.mean())
1.1
print (a.median())
1.0
I have a dataset based on different weather stations,
stationID | Time | Temperature | ...
----------+------+-------------+-------
123 | 1 | 30 |
123 | 2 | 31 |
202 | 1 | 24 |
202 | 2 | 24.3 |
202 | 3 | NaN |
...
And I would like to remove 'stationID' groups, which have more than a certain number of NaNs. For instance, if I type:
**>>> df.groupby('stationID')**
then, I would like to drop groups that have (at least) a certain number of NaNs (say 30) within a group. As I understand it, I cannot use dropna(thresh=10) with groupby:
**>>> df2.groupby('station').dropna(thresh=30)**
*AttributeError: Cannot access callable attribute 'dropna' of 'DataFrameGroupBy' objects...*
So, what would be the best way to do that with Pandas?
IIUC you can do df2.loc[df2.groupby('station')['Temperature'].filter(lambda x: len(x[pd.isnull(x)] ) < 30).index]
Example:
In [59]:
df = pd.DataFrame({'id':[0,0,0,1,1,1,2,2,2,2], 'val':[1,1,np.nan,1,np.nan,np.nan, 1,1,1,1]})
df
Out[59]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
3 1 1.0
4 1 NaN
5 1 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
In [64]:
df.loc[df.groupby('id')['val'].filter(lambda x: len(x[pd.isnull(x)] ) < 2).index]
Out[64]:
id val
0 0 1.0
1 0 1.0
2 0 NaN
6 2 1.0
7 2 1.0
8 2 1.0
9 2 1.0
So this will filter out the groups that have more than 1 nan values
You can create a column to give the the number of null values by station_id, and then use loc to select the relevant data for further processing.
df['station_id_null_count'] = \
df.groupby('stationID').Temperature.transform(lambda group: group.isnull().sum())
df.loc[df.station_id_null_count > 30, :] # Select relevant data
Using #EdChum setup: Since you dont mention your final output, adding this.
vals = df.groupby(['id'])['val'].apply(lambda x: (np.size(x)-x.count()) < 2 )
vals[vals]
id
0 True
2 True
Name: val, dtype: bool
I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?
You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).