descriptive stats for two categorical variables (pandas) - python

I need to get the mean and median for frequencies between two categorical variables. E.g.:
Label Letter Num
Foo | A | 1
Foo | B | 2
Foo | C | 4
Bar | A | 2
Bar | G | 3
Bar | N | 1
Bar | P | 2
Cee | B | 1
Cee | B | 2
Cee | C | 4
Cee | D | 5
For instance, what's the mean and median number of letters per label. Here there are 11 cases out of three possible labels (M=3.667) and the median is 4 (3 foo, 4 bar, 4 cee). How can I calculate this in pandas. Is it possible to do this with a groupby statement? My data set is much larger than this.

Need value_counts for one column or groupby + size (or count if need omit NaNs):
a = df['Label'].value_counts()
print (a)
Cee 4
Bar 4
Foo 3
Name: Label, dtype: int64
#alternative
#a = df.groupby('Label').size()
print (a.mean())
3.6666666666666665
print (a.median())
4.0
a = df.groupby(['Label','Letter']).size()
print (a)
Label Letter
Bar A 1
G 1
N 1
P 1
Cee B 2
C 1
D 1
Foo A 1
B 1
C 1
dtype: int64
print (a.mean())
1.1
print (a.median())
1.0

Related

Count occurrences of specific value in column based on categories of another column

I have a dataset that looks like this:
Categories | Clicks
1 | 1
1 | 3
1 | 2
2 | 2
2 | 1
2 | 1
2 | 2
3 | 1
3 | 2
3 | 3
4 | 2
4 | 1
And to make some bar plots I would like for it to look like this:
Categories | Clicks_count | Clicks_prob
1 | 1 | 33%
2 | 2 | 50%
3 | 1 | 33%
4 | 1 | 50%
so basically: grouping by Categories and calculating on Clicks_count the number of times per category that Clicks takes the value 1, and Clicks_prob the probability of Clicks taking the value 1 (so it's count of Clicks==1/Count of Category i observations)
How could I do this? I tried, to get the second column:
df.groupby("Categories")["Clicks"].count().reset_index()
but the result is:
Categories | Clicks
1 | 3
2 | 4
3 | 3
4 | 2
Try sum and mean on the condition Clicks==1. Since you're working with groups, put them in groupby:
df['Clicks'].eq(1).groupby(df['Categories']).agg(['sum','mean'])
Output:
sum mean
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000
To match output's naming, use named aggregation:
df['Clicks'].eq(1).groupby(df['Categories']).agg(Click_counts='sum', Clicks_prob='mean')
Output:
Click_counts Clicks_prob
Categories
1 1 0.333333
2 2 0.500000
3 1 0.333333
4 1 0.500000

Ranking columns by a single integer

Hi i am basically trying to rank a column in a dataframe into ranking position.
it looks something like this i am trying to create something like this. For person with same number of fruits sold to have the same ranking So that when i sort them by rankings it does not have any decimals. Can anyone advice me?
person | number of fruits sold | ranking
A | 5 | 2
B | 6 | 1
C | 2 | 4
D | 5 | 2
E | 3 | 3
You can use pd.factorize. A few tricks here: take care to negate your series, specify sort=True, add 1 for your desired result.
df['ranking'] = pd.factorize(-df['number of fruits sold'], sort=True)[0] + 1
Result:
person number of fruits sold ranking
0 A 5 2
1 B 6 1
2 C 2 4
3 D 5 2
4 E 3 3
Use Series.rank:
df['ranking'] = df['number of fruits sold'].rank(method='dense', ascending=False).astype(int)
print (df)
person number of fruits sold ranking
0 A 5 2
1 B 6 1
2 C 2 4
3 D 5 2
4 E 3 3

How to replace NaNs with valid value from within Pandas group

I want to replace NaNs in a Pandas DataFrame column with non-NaN values from within the same group. In my case these are geo coordinates where for some reason some data points the lookup failed. e.g.:
df.groupby('place')
looks like
place| lat | lng
-----------------
foo | NaN | NaN
foo | 1 | 4
foo | 1 | 4
foo | NaN | NaN
bar | 5 | 7
bar | 5 | 7
bar | NaN | NaN
bar | NaN | NaN
bar | 5 | 7
==> what I want:
foo | 1 | 4
foo | 1 | 4
foo | 1 | 4
foo | 1 | 4
bar | 5 | 7
bar | 5 | 7
bar | 5 | 7
bar | 5 | 7
bar | 5 | 7
In my case the lat/lng values within the same 'place' grouping are constant, so picking any non-NaN value would work. I'm also curious how I could do a fill with e.g. mean/majority count.
Using groupby along with ffill and bfill
df[['lat', 'lng']]=df.groupby('place').ffill().bfill()
df:
place lat lng
0 foo 1 4
1 foo 1 4
2 foo 1 4
3 foo 1 4
4 bar 5 7
5 bar 5 7
6 bar 5 7
7 bar 5 7
8 bar 5 7
If you have the same values in a given group, the following should work:
df = df.fillna(method = 'ffill').fillna(method = 'bfill')
Fill up nan with first valid value in each group
df.fillna(df.groupby('place').transform('first'))
place lat lng
0 foo 1.0 4.0
1 foo 1.0 4.0
2 foo 1.0 4.0
3 foo 1.0 4.0
4 bar 5.0 7.0
5 bar 5.0 7.0
6 bar 5.0 7.0
7 bar 5.0 7.0
8 bar 5.0 7.0

How to identify a specific occurrence across two rows and calculate the count

Let's say I have these 2 pandas dataframes:
id | userid | type
1 | 20 | a
2 | 20 | a
3 | 20 | b
4 | 21 | a
5 | 21 | b
6 | 21 | a
7 | 21 | b
8 | 21 | b
I want to obtain the number of times 'b follows a' for each user, and obtain a new dataframe like this:
userid | b_follows_a
20 | 1
21 | 2
I know I can do this using for loop. However, I wonder if there is a more elegant solution to this.
You can use shift() to check if a is followed by b with vectorized & and then count the trues with a sum:
df.groupby('userid').type.apply(lambda x: ((x == "a") & (x.shift(-1) == "b")).sum()).reset_index()
#userid type
#0 20 1
#1 21 2
Creative solution:
In [49]: df.groupby('userid')['type'].sum().str.count('ab').reset_index()
Out[49]:
userid type
0 20 1
1 21 2
Explanation:
In [50]: df.groupby('userid')['type'].sum()
Out[50]:
userid
20 aab
21 ababb
Name: type, dtype: object

Pandas data frame: adding columns based on previous time periods

I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?
You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).

Categories

Resources